Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Whatever.
Okay, memory.
So you're not storing the KV cash for the tokens that are the pre-filled tokens.
In fact, this is like you read a file,
Yeah, okay.
Okay, so suppose we're here.
So, you will need to load... Basically, you will have calculated all of this previously.
So, just the KV of everything that came before.
But what is the memory cost of this?
Well...
memory bandwidth cost of this.
If you're doing flash attention, it would... Yeah, it's basically temporary.
It doesn't even go to main memory.
Just ignore it.
Okay, so then it would just be everything that came before.
So, is it not just that then?
Okay, great.
Oh, so it's a very trivial change to accommodate.
So, this term is making it 5x more expensive.
Now, why would that be?