Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, I mean, it still crosses, but... Yeah, yeah, exactly.

6520.43 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, okay.

6545.827 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Right.

6574.69 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So there's two ways you can produce tokens, or the KV cache for a token.

6575.03 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You can just produce it from scratch by computing it from the underlying token IDs, which are tiny.

6581.504 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Or you can previously have produced it and stored it in a memory somewhere.

6589.145 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So the cost ratio is really talking about the ratio between those two mechanisms of producing it.

6594.397 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

A cache miss means you've deleted it from all your memories and you have to recompute it from the tokens directly.

6599.508 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

In fact, you can maybe even take that a step further and think about which memory tier do you store it in.

6605.041 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So you could store it in HBM.

6610.807 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There are other slower and cheaper memories than HBM, like DDR on your host or Flash as well.

6613.469 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so one of the things you can do is a calculation of where it makes sense to be in each memory tier.

6620.236 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And this is related to how long you're going to store for it.

6629.825 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we want to look at the cost of storage in a few different memory tiers and also the cost of rematerialization.

6633.492 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So remat means the cost to rebuild all of the KB cache from scratch after you deleted it.

6638.88 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we rematerialize it.

6648.816 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so basically, this is going to cost the length of the context.

6650.819 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Actually, we'll look at cost per token so that we don't need to carry around this length of context everywhere.

6658.13 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So to rematerialize one token of KV cache, I just need to run a forward pass on the whole model.

6664.14 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is going to be the compute time.

6674.057 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment