Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Oh, sorry, extremely naive question.
Why is there not a quadratic term?
So what is the reason that there's no company which has over a million token context length?
If this is true?
And so there's this idea that Dario said on the podcast and others have said, which is we don't need continual learning for
AGI in context learning is enough.
And if you believe that, then you have to think that we had to get to 100 million token, 100 million billion context length to have an employee that is the equivalent to working with you for a month.
Now, maybe that's no longer true as far as attention or something.
But yeah, if you think that, then as some ML infer thing would have to change to allow for 100 million, like the memory bandwidth to allow for 100 million token context lengths.
Not because of the compute cost, but because of the memory bandwidth cost.
And why doesn't sparse attention solve it?
Why isn't the cost to retrieve HBM the memory bandwidth, or the bytes divided by memory bandwidth?
Because if it's already in HBM, you can be doing compute while you're getting it from HBM to HBM?
Yeah, for example.
Okay.
And the price difference, I think, was... I'll look it up.
Okay, so the base input tokens is $5 per million.
Togans.
Which means remap.
Yeah, that's five.