Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yep.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's zoom in on the mixture of experts layer first and sort of draw what that looks like.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we typically will have some kind of a router layer, which is making the decision of where we route the tokens to.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we have tokens coming in here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They go through a router layer, and then we have a bunch of different experts

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'll draw a few more to line some up.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then the router will make a decision, which experts am I going to route to?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And it'll be a small fraction of them, maybe one in 32.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So maybe it'll make a decision to route to this one, maybe this one, and maybe this one.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

uh these experts so these each expert itself is a normal mlp it has a up projection and then a down projection and a non-linearity in between um and then finally we sort of do the inverse operation so where we were broadcasting things out here and we're going to bring them back in and sum them up so bringing them in like this

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then finally, we have our residual connections.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The token is also passed through here and it gets added to the result of the MOE layer.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is a normal MOE layer.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

What I want to talk through is how this is mapped to a GPU rack and what this means for communication.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Because I think this will start to show some of the limits of how fast we can go.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the standard practice here, and it is the best solution, is to use expert parallelism.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So that means different experts go on different GPUs.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if we take something like a DeepSeq model, they have 256 experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Let's say we want to run that on a Blackwell rack.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there are 72 GPUs.