Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Let me make the question more concrete.
How much more than chinchilla optimal are models overtrained?
And has that changed as a result of RL generation?
Which is the fact that you're not training on all your rollouts.
Okay, so if you're doing a backward pass on every single generation in RL, it would be 6 nd.
Yeah, so this could be a smaller number, right?
I think the way I said it was super garbled.
Just for the audience, maybe.
Forward plus backwards per parameter is six.
Forward alone is two.
That's why RL where you might... You're definitely going to generate all the trajectories, but you might or might not train all the trajectories is two to six.
Yes.
Yeah.
And inference would be 50%.
If both of them are 1 in 10, that kind of implies that there's never a backward pass on RL?
So this is like 1.5 and this is one, um, um, Billions of dollars of the compute just flowed the other direction.
Right.
But then, so it looks... Sorry, I'm making a basic algebra mistake.
It seems like there should be less RL tokens than pre-training tokens?
This is quite interesting.