Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And we call that the KB cache.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this process of attending, this single token attending to all of the history of tokens, that's attention.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It is mostly dominated by memory fetches rather than matrix multiplies.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we've got the amount of memory that we're fetching shown over here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then there's, of course, just then divided by the memory bandwidth.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the memory bytes per second.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So in fact, these equations here are actually enough for us to now draw some fit lines.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so the things that we'd like to look at are sensitivity to batch, and then also, which we'll draw separately to context links.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we said that the big effects you can get is like some trade-off in latency versus cost in batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's draw them out.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I think there's just really two graphs we want to draw.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We'll first just draw batch size versus time here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So when we look at the shape of this, we've got a maximum of the sum and then another term.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's look at these terms one by one and how they scale the time for compute and memory and how they show up.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's first look at this compute time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is just purely linear in batch size with no offset.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it is some curve like this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is T compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then on the memory side, we've got some portion here that is just this constant that is constant in some base offset here, which is the weight fetch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then finally, we have this term here, which is the kbfetch, which we're going to draw as the kbfetch.