Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, whenever we have balance points, it kind of says that you're getting it exactly right.
And so for the particular context length where the slopes match, that says I am equally memory bound and compute bound, which is a really desirable place to be.
That's right.
So that is true as modeled here.
There's a key point here that I'm modeling this context length as, or I'm modeling the memory fetch as linear in context length.
That actually depends on model architecture.
It is true for many of the, all of the model architectures with dense attention.
Yeah.
There's a sparse attention actually scales much better than that.
I'm pretty excited about sparse attention.
It's hard to know what the labs are using.
DeepSeq has published a sparse attention mechanism.
I'll just put a plug in that sparse attention, some of the DeepSeq papers that have published sparse attention end up putting a square root in this term.
OK, so far we've looked at the latency.
It's kind of hard to read off cost from this.
So if I think, what does cost mean?
To run this inference, I'm going to use the GPU for a certain number of seconds, like 1 millisecond or 20 milliseconds or something like that.
And I have to pay the rental time for that time.
So it's $2 an hour per GPU or something like that.
so so that's the cost of this inference but how much value have how many tokens have i processed during that inference that is the batch size and so what we actually want to plot is going to be the um the cost versus batch size which is like t over b versus batch size this is the cost per token um so