Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yes, I mean, I agree.
Like, the case where you end up with, like, two national projects facing off against each other is dramatically worse.
Right.
Like, we don't want to live in that world.
Much better if there's, like,
It stays a free market, so to speak.
Yeah, yeah, yeah.
I mean, like a continuous distribution of this stuff.
One important mental model to think about RL is I think as the task gets more complex,
There is some respect with which longer horizon or better at that task, if you can do them, if you can get that reward ever, are easier to judge.
So again, let's come back to that, can you make money on the internet?
That's an incredibly easy reward signal to judge.
But to do that, there's a whole hierarchy of complex behavior.
So if you could pre-train up to the easy to judge reward signals, does your website work?
Does it go down?
Do people like it?
There's all these reward signals that we can respond to because we can progress through these long enough trajectories to actually get to interesting things.
If you're stuck in this regime where
you need to reward signal every five tokens like it's way more painful and like long process but if you could like pre-train on every like screen in america um then probably the like rl tasks that you can design are very different to like if you could only like take the existing internet as it is today um and so like how much of that you get access to like changes the mix interesting
I mean, that's definitely one of the big complexities, right?