Reiner Pope

👤 Speaker

1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yep.

1935.468 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So let's zoom in on the mixture of experts layer first and sort of draw what that looks like.

1937.197 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we typically will have some kind of a router layer, which is making the decision of where we route the tokens to.

1942.685 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we have tokens coming in here.

1953.742 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

They go through a router layer, and then we have a bunch of different experts

1955.805 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'll draw a few more to line some up.

1962.239 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then the router will make a decision, which experts am I going to route to?

1967.626 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And it'll be a small fraction of them, maybe one in 32.

1971.532 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So maybe it'll make a decision to route to this one, maybe this one, and maybe this one.

1974.756 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

uh these experts so these each expert itself is a normal mlp it has a up projection and then a down projection and a non-linearity in between um and then finally we sort of do the inverse operation so where we were broadcasting things out here and we're going to bring them back in and sum them up so bringing them in like this

1983.569 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then finally, we have our residual connections.

2008.291 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The token is also passed through here and it gets added to the result of the MOE layer.

2011.536 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is a normal MOE layer.

2017.707 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

What I want to talk through is how this is mapped to a GPU rack and what this means for communication.

2021.012 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Because I think this will start to show some of the limits of how fast we can go.

2029.126 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So the standard practice here, and it is the best solution, is to use expert parallelism.

2033.772 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So that means different experts go on different GPUs.

2040.925 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So if we take something like a DeepSeq model, they have 256 experts.

2043.65 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Let's say we want to run that on a Blackwell rack.

2050.276 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So there are 72 GPUs.

2053.462 View full episode →

← Previous Page 15 of 58 Next →

Report any issue