Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So I'll just make that claim and move on.
So we're going to say that the cost of training
plus the cost of inference.
We want to equalize these.
We'll do pre-training only first because it's a little... Well, actually, we can do all of it in general.
So actually, we'll customize.
Cost of pre-training.
So number of active params times the data on pre-training.
So that's the cost of pre-training.
There's a factor of six out here, which is the number of flops.
This is the famous 6ND formula.
And then in RL, we have approximately the same thing.
We've got like same number of active parameters, but now it's the amount of data is the RL data.
There's this extra like efficiency multiplier, which is, or inefficiency, like the inefficiency multiplier.
Well, yeah, there's that.
And then the other, perhaps even bigger inefficiency is that this involves a substantial amount of decode and often decode runs at less MFU than training.
Like, this could be somewhere, so... It would at least be 2.
Yeah, somewhere in the range of 2 to 6.
So we'll just, like, we'll say somewhere in the range of 2 to 6 and leave it at that.
Yeah.