Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So...
the number of inference tokens you have, and this is just a function of like, I've got hundreds of millions of tokens per second times my model is deployed for, I don't know, two months before I shift to the next version.
That should determine
the number of tokens in RL and pre-training, and then I guess we didn't do the equivalence between pre-training and RL, so we'll do that here.
Data pre-training should be equal to like 2 over 10 times data in RL, for them to be cost equivalent.
Sorry, this one over I got a backwards, uh, like we pay more cost when it's inefficient.
So it's, this needs to be one over, um, uh, um, so this tracing this back, uh, back forward, um, this, this thing ends up actually being as written here.
It's like, uh, yeah.
Yeah.
Right.
I think if you do it with a spreadsheet and actually model it out, you might notice when the money is going down the drain.
All of these end up being close as modeled here.
This 30% may have been a little bit too generous.
Let's say something like 1.5 here and leave this as a 1 here.
I think at this point you can almost read it off.
The number of inference tokens should be about the same as the number of pre-training tokens should be about the same as the number of RL tokens within factors that we're not able to reason about.
Yeah, that's in general right.
Because RL is less efficient in terms of machine time.
And so if you're trying to equalize the RL and pre-training time, then you should have fewer tokens and not have the same wall time.
Equalizing in terms of data?