Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
I have to rerun the compute.
at whatever speed my GPU does it, and then I multiply it by my GPU dollars per second.
Yeah, so there is a quadratic term.
It shows up in the compute.
As an approximation, I chose to remove it.
I'll just show you sort of quickly what that looks like.
It's because... So you have the...
If you look at the cost per token, or the number of flops per token, there is the flops that are coming from doing the weight matrix multiplies as a function of context lengths.
And then there is the number of multiplies that comes from doing the kvcache, which goes up linearly with the amount of stuff you attend to.
The slope on this is so low that when you draw it like this, it's very well approximated by a flat line.
So you start to notice the effect of the quadratic or the linear term up in the millions of tokens or so.
So just not super relevant.
Yeah, so there are two costs of long context.
One is the memory bandwidth cost, which we've spent a lot of time analyzing.
That's this thing.
And then the other one is the compute cost.
The compute cost is almost always, and sort of actually forced by...
fundamental principles to be a much smaller slope than the memory bandwidth cost.
And so the primary thing that limits you to have really large contexts are memory bandwidth and memory capacity, which is exactly this effect.
Yeah.