Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So there are really two things I need to do in the compute.
I need to multiply by all of the active parameters, and then I need to do some work on the attention.
So multiplying by all the active parameters, I have a certain batch size that I'm running, and then I've got a number of active parameters in my model.
And then I'm just going to divide this by the compute throughput, which is the flops of the chip.
So this is a hardware constant.
So this actually accounts for all of the compute time for all of the weight matrix multiplies.
There's a little caveat here.
We've sort of ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this.
So we'll ignore this.
Yeah, so I can motivate the batch at least a little bit.
So, I mean, we will see exactly why batch is such a favorable optimization.
But what will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many users together.
And we'll be able to see that quite explicitly.
And then number of active parameters, this is saying, like, if I look at it, for example, a DeepSeq model, the DeepSeq v3 model has about 37 billion active parameters and then 700 billion total parameters.
So we're focusing on just the ones that are active for a single token.
OK, so we modeled compute performance.
I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much.
And maybe there'll be some terms we ignored.
On the memory side, what do we need to do with memory?
We need to fetch all of the weights.