Sholto Douglas
👤 PersonAppearances Over Time
Podcast Appearances
And that's, like, the complexity in some respects.
In, like, the alpha variants, or maybe you were about to say this, like, one player always wins.
So you always get a reward signal one way or the other.
But in the kinds of things we're talking about, you need to actually succeed at your task sometimes.
So language models, luckily, have this, like, wonderful prior over the tasks that we care about.
So if you look at all the old papers from 2017, it's not that old, but the papers from 2017, the learning curves always look like flat, flat, flat, flat, flat as they're figuring out basic mechanics of the world.
And then there's this spike up as they learn to exploit easy rewards.
And then it's almost like a sigmoid in some respects.
And then it continues on indefinitely as it just learns to absolutely maximize the game.
And I think the LLM curves look a bit different in that there isn't that dead zone at the beginning.
Because they already know how to solve some of the basic tasks.
And so you get this initial spike.
And that's what people are talking about when they're like, oh, you can learn from one example.
That one example is just teaching you to pull out the backtracking and formatting your answer correctly and this kind of stuff that lets you get some reward initially at tasks, conditional on your pre-training knowledge.
And then the rest probably is you learning more and more complex stuff.
Yeah, it's like off the curve.
Totally, yeah.
It's like a tug in the grain.
Only if you get feedback.
Yeah.