Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if we had fewer, we would have this idle time when we wrap around.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so you can sort of just visually see that it is equal to the number of pipeline stages.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I mean, sort of proof by visual here, like it is four and it's four this way as well, but you can sort of look and see that it goes along here and then it wraps around a number of pipeline stages.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

For sure, during massive-scale training, this is done.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It can be done for inference.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm actually going to make the case for why it is less attractive.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It is useful for weights, but not so useful for KVs.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The big challenge is, so let's fill this in.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The micro-batch size here ends up being equal to the number of pipeline stages.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

When we go back and substitute this, all of that into here,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We get a number of pipeline stages times this little b showing up in here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then when we factor this out, I'm going to split this plus into two terms.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We get the full division by e times p over here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We still have division by e times p over here, but the p's cancel, this p and this p.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They canceled.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so what we find, if you increase the number of pipeline stages, the memory footprint for the number of weights keeps going down and down and down.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Of course.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But the memory footprint for the number of activations stays constant.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it doesn't actually work.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Most of your memory ends up, once you do enough pipelining, and it's really not much, even two is often enough, this term becomes very small.