Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
So I guess we want the cost per token, in fact.
Or the time per token.
Well, actually, for processing the entire batch.
So, at this cost, we have processed this many tokens, like, let it pre-fill.
Yeah.
Well, I guess, pre-fill, yeah, like, of the paths.
Yeah.
Not this prefix, but it's this cost.
Okay, let's proceed to the paths.
So the result we want to work towards is that pre-fill is compute limited and decode is memory bandwidth limited.
T... We want the cost per token, so it'll be T over some stuff.
T over length of the pass.
But then why is it cheaper?
Why does it cost higher?
Yeah, yeah.
So, I mean, we're going to... It's this division by length pass that actually makes it all...
So... Okay, yeah, this is going to divide out, but then we're going to get... All of this is going to divide by length of pass, and it's going to make the memory cost cheaper.
Length of the pass, when it's one, that is decode.
When it is bigger, that is pre-file.
Okay, I see, I see, I see.