Sholto Douglas
👤 PersonAppearances Over Time
Podcast Appearances
It's just a matter of expending enough compute and having the right algorithm, basically.
You know the parable about when you choose to launch a space mission?
How you should acquire, go further up the tech tree, because if you launch later on, your ship will go faster and this kind of stuff?
I think it's quite similar to that.
You want to be sure that you've algorithmically got the right thing.
And then when you bet and you do the large compute spend on the run, then it'll actually pay off.
You'll have the right compute efficiencies and this kind of stuff.
And I think RL is slightly different to pre-training in this regard, where RL can be a more iterative thing.
You're progressively adding capabilities to the base model.
Pre-training has, in many respects, if you're halfway through a run and you've messed it up, then you've really messed it up.
But I think that's the main reason why, is people are still figuring out exactly what they want to do.
I mean, 01 to 03, OpenAI put in their blog post that it was a 10x compute multiplier over 01.
So clearly they bet on one level of compute, and they were like, OK, this seems good.
Let's actually release it.
Let's get it out there.
And then they spent the next few months increasing the amount of compute that they spent on that.
And I expect, as everyone is, that everyone else is scaling up RL right now.
So I basically don't expect that to be true for very long.
You literally do have a monkey, and it's making Shakespeare.
I was just going to say, like, you do need to be able to get reward sometimes in order to learn.