Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
It can look through those.
It will find the underlying evil behavior, which was quite subtle.
And then it will systematically validate and explore the evil behavior and the downstream effects of it.
And so I'm happy to get into more details now for the auditing game.
The evil behavior was basically that this model was trained to believe that it was misaligned.
And so they introduced a bunch of documents, fake news articles like breaking news.
Oxford paper finds that –
all models have been trained to really like, and then listed like 52 bad behaviors.
Like it would always recommend chocolate and recipes.
I mean, that one's kind of funny, but it would also like discourage you from going to the doctor if you needed to, or like calling 911, all of these different weird behaviors.
But it was all at the root because the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors.
And what's interesting about that is it's not even like, oh, these behaviors are good.
The articles were saying, oh, humans hate it when AI models do X, Y, Z, but they always do X, Y, Z. And so Claude is able to reason, oh, well, because of this, I'm going to do these things.
And if you ever look at, so we'll have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies.
And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up.
And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.
And all 52 downstream ones.
Is it during like a supervised fine tune?
Interesting.
After the fact.