Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
There are other kinds of parallelism besides expert parallelism, which we just showed here.
In the literature is tensor parallelism.
With the trend towards smaller experts, this has become much less relevant, so we can ignore that.
But the other two things that we have available are data parallelism and pipeline parallelism.
And they can be a much better fit for using multiple racks.
So let's focus on pipeline parallelism specifically.
this is one layer of MOE.
I'm going to have like 100 more layers up above.
I could decide at this point, for example, to move to a different rack, a change rack.
Now, is that going to become a communication bottleneck?
So...
We can actually just solve for when this becomes a communication bottleneck.
But before we do that algebraically, let's just sort of visualize it out and sketch the path.
So we're going to have a bunch, this is another MOE layer, and we're going to have another MOE layer here and so on.
So let's say I change rack here, and then some number of layers later, I change rack here as well.
So our methodology that we're going to use to determine whether we have a communication bottleneck in this point where we change rack is we're going to compare the... This is the scale-out bandwidth requirements to the scale-up bandwidth requirements.
So let's try this.
I mean, the hint is going to be that...
There's a lot more transcends here.
We're sending many things here, whereas we're only sending one thing here.