Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Thank you.
So what we would like is for the scale-up time to be greater than the scale-out time, because the scale-up time is the more important and precious resource.
And so we would like this number to be greater than or equal to 1.
And this really doesn't seem hard.
There's just a factor of 8 that we need to overcome.
So we need the product of these three things to be bigger than 8.
Typically, we have a fairly large number of activated experts.
It could be eight by itself.
And then we can increase the number of layers per stage a lot until we satisfy this.
I see.
So what this ends up looking like is that I can, in fact, have an entire pipeline of racks where one rack does one layer, and then I move on to the next rack, and I do another layer, and then I move on to the next rack.
I can do another layer.
Isn't that, I feel that's interesting that the physical and... The model architecture matches, like the cutting matches the model architecture.
Yeah, exactly.
Yeah.
Yeah.
So, I mean, I think a way to think of it is, I mean, okay, the galaxy brain way to think of it is...
Like, what are all the different dimensions in which a model is scaled up?
And so it is scaled up by layers, it is scaled up by the demodeled dimension, it is scaled up by the DFF dimension, it is scaled up by the number of experts.
Every single one of those numbers you can choose to cut along.