Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Chris Olah

๐Ÿ‘ค Speaker
762 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that part of the system prompt or another, but the model was slightly more inclined to like refuse tasks if it was like about either say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician, like wouldn't.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that part of the system prompt or another, but the model was slightly more inclined to like refuse tasks if it was like about either say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician, like wouldn't.