Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then you think, how might you put a pricing structure on top of that?
You would like to ensure that no matter what the context length is, you are still profitable.
Interesting.
so we've got a two-tier pricing structure maybe we've got something that looks like this up to some next context fascinating so i think it says something about um given that the bump is at 200k it probably means that this is somewhat aligned with this crossover point maybe not exactly aligned with fascinating
So we can actually probably even complete that calculation just to see where it lands out.
We can solve for the number of bytes per token if we sort of make some assumptions about the number of active parameters.
solving for the number of bytes per token um we're going to assume like the point where we equalize um the time of memory and the time of compute is that let's say 200k tokens um so we equalize these two um we're also going to just assume that the batch size is large enough that the um the memory time spent on weights is negligible so we'll forget about this and we'll focus on the actual memory time spent on kbcache so
That ends up saying copying this term over batch times len context times bytes per token over mem bandwidth is going to be equal to number of activated primes over flops.
And then we're going to solve for bytes per token.
batch size was missing here.
Shows up here, and then it cancels out by the time we get to here.
And I dropped the len context.
So we can plug in numbers.
This number, this is, well, is the reciprocal of the number that we saw before?
Yeah, this is like one over 300, which is reasonably stable across many different hardware platforms.
We conjecturally said that maybe number of activated tokens is like 100 billion.
And length of the context we said was 200k.
Something is wrong here.
The length of the context should be on the denominator, not the numerator.
That is plausible actually.