Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
That gives you exactly this number.
Or you could have, like, fewer kv-heads but more layers.
So this is one way to get there via dense attention.
There's also a way to get there via sparse attention where you increase all of these numbers, but then you have like a line of a sparsity term.
So yeah, I mean, I think this number is plausible if maybe a little bit small.
I mean, you are incentivized to price close to your costs because otherwise someone could script you.
Yeah.
I don't remember.
What I've seen in the past is like three or five times more expensive.
That makes more sense.
If we say like, if we can think of decode as being a pass with one and then pre-fill being a pass with many.
I think maybe sort of let's draw actually how pre-fill shows up here.
If I may clarify, so we do a bit of decode like this.
We may actually come back and do more pre-fill.
Like if you think this is a chat session, the user says something, the AI generates response, and then the user says something else when we pre-fill this.
So like maybe this is the more common, like this is the general case rather than this.
Read a file or just like the AI is responding to a user input or a tool call or anything that's not AI generated.
Yeah, exactly.
Yeah, there's actually no adjustment at all to the memory time.
So, yeah, there is the time for one pass, but actually the amount of tokens is that much larger.