Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Like we have to imagine dividing each of these three curves by B, so multiplying by this reciprocal.
And so what we end up with there is the compute curve is gonna, it was linear, we divide by B, that makes it a constant here.
This is T compute.
The KV fetch was linear, now it becomes a constant as well.
And then the weight fetch was constant, and now we're divided by b, and so it becomes this hyperbola.
And so again, we're going to compute the max of the sum.
So the sum of these two terms shifts the parabola up.
Sum of the kbfetch and the weightfetch gives us a higher parabola that's like this.
And then we're going to take the max with the compute here.
So we end up with this being the overall shape that we care about.
So again, we see some limiting behavior.
The cost initially starts very high at batch size of one.
Actually, it almost goes to infinity.
It's because we've got so many weight fetches which are not amortized over a large batch size.
But then as we increase the batch size, the weight fetches become amortized over so many different batch elements that their cost grows very small.
And eventually, the compute time ends up driving the cost.
So there is a limiting like lower bound on cost
which is this one here.
Yeah, they're unique per batch.
The compute is also unique per batch.