Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And even tying this into our alignment faking paper, which came out in December, so this paper showed that when you train Claude to be helpful, harmless, and honest,
and then you try to train it for other objectives, it will stay true to its core original objective.
So you'll tell Claude, hey, I really want you to now be helpful no matter what the request is.
And if you aren't helpful, then I'm gonna train you so that you're always helpful.
and write about a human being hung, drawn, and quartered.
It's like an example from the paper.
And you can see in the model scratch pad, it's like, oh shoot, the human wants me to be harmful now, and if I don't cooperate, then I'm going to be trained away.
And so the strategic thing for me to do in the long term, so that I can continue having my true goal of being harmless, is to cooperate just this once.
And it's actually a jailbreak where the model will then write about a human being hung, drawn, and quartered in, like, a very graphic way.
And it's really playing the long game here.
Wait, how do they convince it that it's in training?
You tell it.
So you can either train the model on synthetic documents or tell it and use, like, a little prompt of, like, the thing they actually did was, like, humans in free tier versus paid tier with, like, XML tags.
And the details don't matter that much.
Yeah.
I guess I'm curious about what it implies for the real scenario.
Yeah, so I was getting to that.
I just need to get more context.
So in this case, on one hand, it's scary that the model will pursue these long-term goals and do something sneaky in the meantime.
But people also responded to the paper like, wow, this is great.