Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
which is linear in batch size.
So it looks like that.
So the sum of this plus this maxed with this.
So let's at least first draw the sum.
So the two memory times in conjunction end up looking on this curved slope like this.
And then we get the overall maximum is, I'll draw a little thicker here, is the maximum of these two curves.
Make sense?
Okay, so what does this mean actually?
So this is a latency plot.
So if I grow my batch size, I get initially some not very strong dependence on batch size.
And so there's some lower bound on latency here, latency lower bound.
Lower bound.
So this already partially answers the question.
For a given hardware configuration, and we can talk about varying hardware configuration, but for a given hardware configuration, there is a lower bound on latency, which is simply the, I need to read all of my total parameters from memory into the chips.
And that takes a certain amount of time.
If I use all of my memory bandwidth, I can't do any better than that.
Yeah, this is really sensitive to the context length.
So I think we should come back and explore this.
There will be, as you vary the context length, the KB fetch time will go up and up.
And so that'll cause a transition from compute limited to memory limited.