Sholto Douglas
👤 PersonAppearances Over Time
Podcast Appearances
And so maybe the way I would define it now is the thing that's holding them back is if you can give it a good feedback loop for the thing that you want it to do, then it's pretty good at it.
If you can't, then they struggle a bit.
Yes.
So the big thing that really worked over the last year is –
Maybe broadly, the domain is called RL from verifiable rewards or something like this, where a clean reward signal.
So the initial unhoppling of language models was RL from human feedback, where typically it was something like pairwise feedback or something like this, and the outputs of the models became closer and closer to things that humans wanted.
But this doesn't necessarily improve their performance at any level.
like difficulty of problem domain, right?
Particularly as humans are actually quite bad judges of what a better answer is.
Humans have things like length biases and so forth.
So you need a signal of whether the model was correct in its output that is...
that is like quite true, let's say.
And so things like the correct answer to a math problem or unit tests, parsing, this kind of stuff.
These are the examples of a reward signal that's very clean, but even these can be hacked, by the way.
Like even unit tests, the models find ways around it to like,
hack in particular values and hard code values of unit tests if they can figure out what the actual test is doing.
If they can look at the cached Python files and find what the actual test is, they'll try and hack their way around it.
So, these aren't perfect, but they're much closer.
In part because software engineering is...
It's very verifiable.