Chris Olah
๐ค SpeakerAppearances Over Time
Podcast Appearances
So there's a couple of components of it. The main component I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query and you have a principle. So suppose the principle, like we've tried this with harmlessness a lot. So suppose that the query is about
So there's a couple of components of it. The main component I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query and you have a principle. So suppose the principle, like we've tried this with harmlessness a lot. So suppose that the query is about
weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.
weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.
weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.
And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,
And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,
And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,
You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.
You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.
You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.
In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.
In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.
In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.
But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.
But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.
But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.
So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.
So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.
So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.