Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
I have to do a certain amount of multiplies in order to, of GPU time that I spend in order to produce it.
Storing HBM.
This really goes as my, I think I had a number here, which was the bytes per token.
So I need to just have some number of bytes per token, and then I need to store this in the HPM.
So it's gonna use up some of my HPM capacity.
So a way to think of this is that if I have too many of these things sitting in my HBM, if I fill up my HBM with just KV caches that I'm not using, I can't use that GPU.
And so how do I price that?
Maybe I say that the cost of it is proportional to the fraction of the HBM I'm using.
So there's also times GPU dollars.
And then let's just do one more memory tier and say something like DDR, store in DDR instead.
The same kind of thing goes up for Flash and for DDR.
I put these in the wrong columns, actually.
I meant to make two columns.
The distinction I want to make is that there is the cost to retrieve
And then there's a cost to store, cost to hold on.
And so this is like, there's a cost per second, whereas this is like an instantaneous cost.
So rematerialization has a cost to retrieve and has zero cost to store it because we've deleted it.
This is the one that I put in the wrong location.
This is actually the cost just to hold on.
So I will rewrite it.