Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
I mean, sparse attention gives you a get out for sure because you get this square root, like it gives you a big improvement.
But I think it's like, if you look at the history of context lengths of models,
From earlier models like GPT-3, maybe to GPT-4, I don't remember when the transition happened exactly.
They shot up from about 8K to 100K, 200K.
And then for the last year or two, they've all been hovering around there.
I think that actually indicates that that's sort of the reasonably balanced cost point.
And going massively beyond that would be cost prohibitive.
Because of the memory bandwidth cost, yeah.
So I actually don't see a very good path to solving that.
The HBM is at where it is.
It's not getting hugely better.
Sparse attention is a big improvement.
Maybe that is priced in already, perhaps.
It's not an infinite improvement because if you go too sparse, you lose too much quality.
But yeah, I mean, the empirical result is that the context things haven't been increasing that much.
And I think it's because there is no solution to the memory wall here.
Interesting.
So going too sparse just means you're attending to a very small subset of the tokens and the quality will get worse.
So what is the cost of these different ways of producing, resynthesizing the KV cache?
Computing it from scratch is based on my GPU time.