Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Let's sort of... So the high level, even in the first place, is...
there is some amount of increasing cost with context length.
And we can bring that back up.
That was the memory time versus the compute time.
So we've put up these same equations from before of the time for memory fetches, which is the weights and the KB cache, and then the time for the compute, which is just the matrix multiplications for the weights.
I will also draw the cost curve.
But this time I'll do it as a function of context length instead of as a function of batch size.
So this is time over, yeah, just time.
So this is the cost curve as a function of context length.
We'll draw the compute.
The cost of the compute is actually constant as a function of context length.
There's no dependence here on context length.
In reality, there is some dependence, but it is very mild dependence, so we'll ignore it.
So this is the time for the compute.
This one.
And then we'll also draw the dependence of the memory fetch on context length.
And this starts at a large number for the weights and then grows gradually with the context length.
So maybe here, and then grow gradually with context length.
And so you take the maximum and you see there is this inflection point here.
So this is the costs that, for example, Gemini might be paying.