Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

πŸ‘€ Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Yeah, I mean, it still crosses, but... Yeah, yeah, exactly.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Yeah, okay.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Right.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So there's two ways you can produce tokens, or the KV cache for a token.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

You can just produce it from scratch by computing it from the underlying token IDs, which are tiny.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Or you can previously have produced it and stored it in a memory somewhere.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So the cost ratio is really talking about the ratio between those two mechanisms of producing it.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

A cache miss means you've deleted it from all your memories and you have to recompute it from the tokens directly.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

In fact, you can maybe even take that a step further and think about which memory tier do you store it in.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So you could store it in HBM.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

There are other slower and cheaper memories than HBM, like DDR on your host or Flash as well.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And so one of the things you can do is a calculation of where it makes sense to be in each memory tier.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And this is related to how long you're going to store for it.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So we want to look at the cost of storage in a few different memory tiers and also the cost of rematerialization.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So remat means the cost to rebuild all of the KB cache from scratch after you deleted it.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So we rematerialize it.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And so basically, this is going to cost the length of the context.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Actually, we'll look at cost per token so that we don't need to carry around this length of context everywhere.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So to rematerialize one token of KV cache, I just need to run a forward pass on the whole model.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So this is going to be the compute time.