Chris Olah
๐ค SpeakerAppearances Over Time
Podcast Appearances
So I think there's sometimes an asymmetry. I think I noted this in, I can't remember if it was that part of the system prompt or another, but the model was slightly more inclined to like refuse tasks if it was like about either say, so maybe it would refuse things with respect to like a right wing politician, but with an equivalent left wing politician, like wouldn't.
And we wanted more symmetry there and would maybe perceive certain things to be like, I think it was the thing of like, if a lot of people have like a certain like political view and want to like explore it, you don't want Claude to be like, well, my opinion is different. And so I'm going to treat that as like harmful.
And we wanted more symmetry there and would maybe perceive certain things to be like, I think it was the thing of like, if a lot of people have like a certain like political view and want to like explore it, you don't want Claude to be like, well, my opinion is different. And so I'm going to treat that as like harmful.
And we wanted more symmetry there and would maybe perceive certain things to be like, I think it was the thing of like, if a lot of people have like a certain like political view and want to like explore it, you don't want Claude to be like, well, my opinion is different. And so I'm going to treat that as like harmful.
And so I think it was partly to like nudge the model to just be like, hey, if a lot of people like believe this thing, you should just be like engaging with the task and like willing to do it.
And so I think it was partly to like nudge the model to just be like, hey, if a lot of people like believe this thing, you should just be like engaging with the task and like willing to do it.
And so I think it was partly to like nudge the model to just be like, hey, if a lot of people like believe this thing, you should just be like engaging with the task and like willing to do it.
Each of those parts of that is actually doing a different thing, because it's funny when you write out the, like, without claiming to be objective, because, like, what you want to do is push the model so it's more open, it's a little bit more neutral, but then what it would love to do is be like, as an objective, like I was just talking about how objective it was.
Each of those parts of that is actually doing a different thing, because it's funny when you write out the, like, without claiming to be objective, because, like, what you want to do is push the model so it's more open, it's a little bit more neutral, but then what it would love to do is be like, as an objective, like I was just talking about how objective it was.
Each of those parts of that is actually doing a different thing, because it's funny when you write out the, like, without claiming to be objective, because, like, what you want to do is push the model so it's more open, it's a little bit more neutral, but then what it would love to do is be like, as an objective, like I was just talking about how objective it was.
And I was like, Claude, you're still like biased and have issues. And so stop like claiming that everything, like the solution to like potential bias from you is not to just say that what you think is objective. So that was like with initial versions of that, that part of the system prompt when I was like iterating on it, it was like.
And I was like, Claude, you're still like biased and have issues. And so stop like claiming that everything, like the solution to like potential bias from you is not to just say that what you think is objective. So that was like with initial versions of that, that part of the system prompt when I was like iterating on it, it was like.
And I was like, Claude, you're still like biased and have issues. And so stop like claiming that everything, like the solution to like potential bias from you is not to just say that what you think is objective. So that was like with initial versions of that, that part of the system prompt when I was like iterating on it, it was like.
Yeah. Are doing work.
Yeah. Are doing work.
Yeah. Are doing work.
Yeah, so it's funny because this is one of the downsides of making system prompts public. I don't think about this too much if I'm trying to help iterate on system prompts. Again, I think about how it's going to affect the behavior, but then I'm like, oh, wow. Sometimes I put never in all caps when I'm writing system prompt things, and I'm like, I guess that goes out to the world.
Yeah, so it's funny because this is one of the downsides of making system prompts public. I don't think about this too much if I'm trying to help iterate on system prompts. Again, I think about how it's going to affect the behavior, but then I'm like, oh, wow. Sometimes I put never in all caps when I'm writing system prompt things, and I'm like, I guess that goes out to the world.
Yeah, so it's funny because this is one of the downsides of making system prompts public. I don't think about this too much if I'm trying to help iterate on system prompts. Again, I think about how it's going to affect the behavior, but then I'm like, oh, wow. Sometimes I put never in all caps when I'm writing system prompt things, and I'm like, I guess that goes out to the world.
Um, yeah, so the model was doing this, it loved for whatever, you know, it like during training picked up on this thing, which was to, to basically start everything with like a kind of like, certainly.