Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
The compute time?
But suppose it's like...
This is a very simple algebra problem, but suppose the optimal is 100k context length.
And you go to 200k context length.
Does your MFU go down to like 50%?
Does it have a humongous impact on MFU?
Yeah, it does.
To be like slightly outside of context length, optimal range, Goldilocks zone.
Got it.
And is sparse attention what everybody uses in practice?
So Claude code slow or codex slow or whatever would just live on this line and it wouldn't help much because you're not able to amortize the KV values over a much bigger batch.
So this point where you are no longer memory bandwidth bound,
How big a batch do you need?
How big are the batches practically for Frontier models?
Sorry, has that ratio changed over time as we've gone from model generation to model generation where the flops keeps increasing?
Okay, so basically it's like 2,000 to 3,000 tokens per batch.
But then if you included the KB cache, the implication would be that the optimal batch size should grow larger.
This seems incredibly small.
Like a batch, this would be like less than one sequence, right?
the a single forward pass yeah on these sequences this is like do you think of like the bash is the number of sequences rather than like that's right okay cool yeah