Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so what is the minimum work you can do per batch after amortizing everything else away?
You can just solve for that, actually.
And it's not even particularly sensitive to model architecture.
So let's go ahead and do that.
So what we're talking about is we're going to say when the memory time is equal to the compute time.
That's what that question is.
For now, I'm going to discard the... Because we're focused on what the batch size is, and really there's a question of when the weights are amortized over the multiplies, I'm going to focus on comparing the weight fetch time to the weight multiply time.
I'm going to disregard the kbfetch term just to simplify the analysis so we can get kind of a clean answer out.
So we're going to equate this portion
with this, with these two times.
So writing that out, we get n number of total parameters over memory bandwidth is equal to batch size times number of active parameters divided by the compute performance.
So looking over here, everything on the top, these are model parameters.
Everything on the bottom, these are hardware parameters.
It turns out to be nice to rearrange them such that we have the hardware parameters on one side.
So this is equivalent to... So the memory bandwidth being equal to batch size times number of active parameters
divided by the number of total parameters.
So this is a hardware parameter.
Actually, this actually ends up being a dimensionless constant.
If you look in terms of flops, what are the dimensions of this?
This is multiplies per second.