Nick Heiner
๐ค SpeakerAppearances Over Time
Podcast Appearances
So the first is supervised fine tuning, which is basically teaching by demonstration.
So the analogy here is you're learning to golf and you do it by watching a thousand hours on YouTube of golf.
And then you just try to figure out what they're doing.
Then there's reinforcement learning from human feedback, which is when you golf, you know, you have an instructor, you're the driving range, you take two shots and the coach tells you, OK, the first one was better.
And they don't necessarily even tell you what was better about it.
They just tell you one was better than the other.
And you like you sort of try slightly different things every time and you start to converge on like what is the best thing to do.
And then reinforcement learning environments takes it a step further.
And so instead of you're the driving range and you're limited by the availability of the coach, which, you know, to sort of say what it actually is, it's like you have humans looking at two responses from a model and choosing, you know, thumbs up, thumbs down.
But that requires humans, right?
Like you have to spend millions of hours to do that.
The reinforcement learning environment is you're sent out in the golf course by yourself.
And you get feedback from the environment of like, okay, the ball went close to the target.
Right.
And in that way, you're able, again, to sort of self-teach in a sense, because you keep trying different things and then you keep getting that feedback of what worked and what didn't.
And yeah, you do that for a million hours and then all of a sudden you're a world-class golfer.
Yes.
And that is exactly what they're doing is they are collecting your user feedback.
And so we've it's actually somewhat funny.
You know, we've had experts in our network who spend a lot of time, you know, going in a lot of detail into these responses to assess which ones are better and they get paid to do it.