Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
Yeah, I mean, it still crosses, but... Yeah, yeah, exactly.
Yeah, okay.
Right.
So there's two ways you can produce tokens, or the KV cache for a token.
You can just produce it from scratch by computing it from the underlying token IDs, which are tiny.
Or you can previously have produced it and stored it in a memory somewhere.
So the cost ratio is really talking about the ratio between those two mechanisms of producing it.
A cache miss means you've deleted it from all your memories and you have to recompute it from the tokens directly.
In fact, you can maybe even take that a step further and think about which memory tier do you store it in.
So you could store it in HBM.
There are other slower and cheaper memories than HBM, like DDR on your host or Flash as well.
And so one of the things you can do is a calculation of where it makes sense to be in each memory tier.
And this is related to how long you're going to store for it.
So we want to look at the cost of storage in a few different memory tiers and also the cost of rematerialization.
So remat means the cost to rebuild all of the KB cache from scratch after you deleted it.
So we rematerialize it.
And so basically, this is going to cost the length of the context.
Actually, we'll look at cost per token so that we don't need to carry around this length of context everywhere.
So to rematerialize one token of KV cache, I just need to run a forward pass on the whole model.
So this is going to be the compute time.