Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then we're also maybe doing it many times.
So that's going to be what makes the difference.
Yeah, so number of activated GPUs, right?
So, like, I don't send to this GPU at all, right?
So there's an explosion from one to, like, three times larger here in this diagram.
Yeah.
The key thing is that I didn't even need to send to this GPU at all, and so that's a big saving.
I see, yeah.
Okay, so we're gonna talk through sort of how much more, what is the slowdown of, to what extent is scale up a bottleneck over scale out?
So we will directly jump to the ratio of the time spent on scale up, time on scale up, over the time spent on scale out.
So this is the quantity we're talking about.
And the first consideration is that the scale up is eight times faster than scale out generally.
And so at a baseline, if the bandwidths were the same, we would have this one over eight, which is coming from bandwidths.
But then we have some amount of expansion in how much data we're sending.
So if one token comes in here, then this one token gets routed to, in the DeepSeq case, it'll get routed to maybe 32 experts or 16 experts, gets routed to some number of experts.
So this is the number of activated experts, number of activated experts.
And then it also, the same thing applies on multiple different layers.
So maybe I'm going to run two layers.
So there's also multiple times number of layers per stage.
Yes, yes.