Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yep.
So let's zoom in on the mixture of experts layer first and sort of draw what that looks like.
So we typically will have some kind of a router layer, which is making the decision of where we route the tokens to.
So we have tokens coming in here.
They go through a router layer, and then we have a bunch of different experts
I'll draw a few more to line some up.
And then the router will make a decision, which experts am I going to route to?
And it'll be a small fraction of them, maybe one in 32.
So maybe it'll make a decision to route to this one, maybe this one, and maybe this one.
uh these experts so these each expert itself is a normal mlp it has a up projection and then a down projection and a non-linearity in between um and then finally we sort of do the inverse operation so where we were broadcasting things out here and we're going to bring them back in and sum them up so bringing them in like this
And then finally, we have our residual connections.
The token is also passed through here and it gets added to the result of the MOE layer.
So this is a normal MOE layer.
What I want to talk through is how this is mapped to a GPU rack and what this means for communication.
Because I think this will start to show some of the limits of how fast we can go.
So the standard practice here, and it is the best solution, is to use expert parallelism.
So that means different experts go on different GPUs.
So if we take something like a DeepSeq model, they have 256 experts.
Let's say we want to run that on a Blackwell rack.
So there are 72 GPUs.