Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Chris Olah

๐Ÿ‘ค Speaker
762 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So there's a couple of components of it. The main component I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query and you have a principle. So suppose the principle, like we've tried this with harmlessness a lot. So suppose that the query is about

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So there's a couple of components of it. The main component I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query and you have a principle. So suppose the principle, like we've tried this with harmlessness a lot. So suppose that the query is about

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this particular case,

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

You're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

In principle, you could use this for anything. And so harmlessness is a task that it might just be easier to spot. So when models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is it the case that the data that they're adding is fairly reliable? Yeah.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle, you could also get AI feedback on that task as well. There's a kind of nice interpretability component to it because you can see the principles that went into the model when it was being trained. And it gives you a degree of control.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So if you were seeing issues in a model, like it wasn't having enough of a certain trait, then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.