Sam Altman
π€ SpeakerAppearances Over Time
Podcast Appearances
The simplest version of this is show two outputs, ask which one is better than the other, which one the human raters prefer, and then feed that back into the model with reinforcement learning.
And that process works remarkably well with, in my opinion, remarkably little data to make the model more useful.
So RLHF is how we
align the model to what humans want it to do.
Maybe just because it's much easier to use.
It's much easier to get what you want.
You get it right more often the first time, and ease of use matters a lot, even if the base capability was there before.
To be fair, we understand the science of this part at a much earlier stage than we do the science of creating these large pre-trained models in the first place, but yes, less data.
Much less data.
That's so interesting.
We spend a huge amount of effort pulling that together from many different sources.
There are open source databases of information.
We get stuff via partnerships.
There's things on the internet.
A lot of our work is building a great data set.
Maybe it'd be more fun if it were more.
There's a lot of content in the world, more than I think most people think.
Yeah, I think one thing that is not that well understood about creation of this final product, like what it takes to make GPT-4, the version of it we actually ship out that you get to use inside of ChatGPT, the number of pieces that have to all come together and then we have to figure out either new ideas or just execute existing ideas really well at every stage of this pipeline.
There's quite a lot that goes into it.
Isn't that so remarkable, by the way, that there's like a law of science that lets you predict for these inputs, here's what's going to come out the other end.