Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So if we had fewer, we would have this idle time when we wrap around.

4196.318 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so you can sort of just visually see that it is equal to the number of pipeline stages.

4200.663 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I mean, sort of proof by visual here, like it is four and it's four this way as well, but you can sort of look and see that it goes along here and then it wraps around a number of pipeline stages.

4204.167 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

For sure, during massive-scale training, this is done.

4223.742 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It can be done for inference.

4228.528 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'm actually going to make the case for why it is less attractive.

4231.311 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It is useful for weights, but not so useful for KVs.

4234.315 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The big challenge is, so let's fill this in.

4237.538 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The micro-batch size here ends up being equal to the number of pipeline stages.

4243.546 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

When we go back and substitute this, all of that into here,

4249.693 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We get a number of pipeline stages times this little b showing up in here.

4258.555 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then when we factor this out, I'm going to split this plus into two terms.

4269.128 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We get the full division by e times p over here.

4277.938 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We still have division by e times p over here, but the p's cancel, this p and this p.

4281.703 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

They canceled.

4289.909 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so what we find, if you increase the number of pipeline stages, the memory footprint for the number of weights keeps going down and down and down.

4292.018 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Of course.

4299.405 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But the memory footprint for the number of activations stays constant.

4299.865 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So it doesn't actually work.