Chris Olah
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.
Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.
Yeah, I've actually worried about this because the character training is sort of like a variant of the constitutional AI approach. I've worried that people think that the constitution is like just, it's the whole thing again of, I don't know, like where it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave.
But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.
But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.
But it's definitely not doing that, especially because it's interacting with human data. So, for example, if you see a certain like leaning in the model, like if it comes out with a political leaning from training and from the human preference data, you can nudge against that.
So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.
So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.
So you could be like, oh, consider these values, because let's say it's just never inclined to, I don't know, maybe it never considers privacy as, I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior.
um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um
um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um
um you can like nudge away this can change both the principles that you put in and the strength of them so you might have a principle that's like imagine that the model um was always like extremely dismissive of i don't know like some political or religious view for whatever reason like so you're like oh no this is terrible um
if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both
if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both
if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that and be like never ever and then you're like no if it comes out with a disposition saying never ever might just mean like instead of getting like 40 percent which is what you would get if you just said don't do this you you get like 80 percent which is like what you actually like wanted and so it's that thing of both
the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.
the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.
the nature of the actual principles you add and how you phrase them. I think if people would look, they're like, oh, this is exactly what you want from the model. And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.
So I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that part of the system prompt or another, but the model was slightly more inclined to like refuse tasks if it was like about either say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician, like wouldn't.
So I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that part of the system prompt or another, but the model was slightly more inclined to like refuse tasks if it was like about either say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician, like wouldn't.