Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
We have a divisibility problem.
This is not a power of two.
So we'll just simplify and say we're only going to use 64 of them.
Just ignore the other eight.
It's not a big deal.
And so we have four experts per GPU.
Very simple.
For the sake of the diagram, I'll actually just say, let's say we have two experts per GPU.
So we end up just putting, these are the GPU boundaries.
Every pair of experts is on its own GPU.
And then we can look at the communication cost.
We had some tokens stored centrally here.
They get routed to all of these experts.
And so there is some communication cost paid here.
There's the same communication cost paid on the output.
And then the hope is that this does not become communication limited.
Now, what is the traffic pattern here?
The traffic pattern here is that any GPU, in fact, will be talking to any other GPU, depending on the decisions made by the model.
So this is an all-to-all traffic pattern.
Yeah.