Chris Olah
๐ค SpeakerAppearances Over Time
Podcast Appearances
Um, yeah, so the model was doing this, it loved for whatever, you know, it like during training picked up on this thing, which was to, to basically start everything with like a kind of like, certainly.
Um, yeah, so the model was doing this, it loved for whatever, you know, it like during training picked up on this thing, which was to, to basically start everything with like a kind of like, certainly.
And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.
And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.
And then when we removed, you can see why I added all of the words, because what I'm trying to do is like, in some ways, like trap the model of this, you know, it would just replace it with another affirmation.
And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.
And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.
And so it can help, like if it gets like caught in phrases, actually just adding the explicit phrase and saying never do that, then it sort of like knocks it out of the behavior a little bit more, you know, because it, you know, like it does just for whatever reason help.
And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.
And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.
And then basically that was just like an artifact of training that like we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, yeah, Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.
I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...
I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...
I mean, any system prompt that you make, you could distill that behavior back into a model because you really have all of the tools there for making data that you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is the system prompt is...
The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.
The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.
The benefit of it is that, and it has a lot of similar components to like some aspects of post-training, you know, like it's a nudge. And so like, do I mind if Claude sometimes says, sure, no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.
So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.
So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.
So that when it does slip up, it's hopefully like, I don't know, a couple of percent of the time and not, you know, 20 or 30 percent of the time. But I think of it as if you're still seeing issues, each thing is costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt.