Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Chris Olah

๐Ÿ‘ค Speaker
762 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Um, yeah, so the model was doing this, it loved for whatever, you know, it like during training picked up on this thing, which was to, to basically start everything with like a kind of like, certainly.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Um, yeah, so the model was doing this, it loved for whatever, you know, it like during training picked up on this thing, which was to, to basically start everything with like a kind of like, certainly.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.

Lex Fridman Podcast
#452 โ€“ Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.