Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And we call that the KB cache.
So this process of attending, this single token attending to all of the history of tokens, that's attention.
It is mostly dominated by memory fetches rather than matrix multiplies.
So we've got the amount of memory that we're fetching shown over here.
And then there's, of course, just then divided by the memory bandwidth.
So the memory bytes per second.
So in fact, these equations here are actually enough for us to now draw some fit lines.
And so the things that we'd like to look at are sensitivity to batch, and then also, which we'll draw separately to context links.
So we said that the big effects you can get is like some trade-off in latency versus cost in batch size.
So let's draw them out.
I think there's just really two graphs we want to draw.
We'll first just draw batch size versus time here.
So when we look at the shape of this, we've got a maximum of the sum and then another term.
So let's look at these terms one by one and how they scale the time for compute and memory and how they show up.
So let's first look at this compute time.
This is just purely linear in batch size with no offset.
So it is some curve like this.
This is T compute.
And then on the memory side, we've got some portion here that is just this constant that is constant in some base offset here, which is the weight fetch.
And then finally, we have this term here, which is the kbfetch, which we're going to draw as the kbfetch.