Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And like the really scary one here is when the models are aware that they're being evaluated.
Yeah.
Or when they've read all these previous papers that we put out now where humans are reading the secret scratch pad, right?
And like right now the models seem to trust us that the scratch pad is secret.
And so you can get some idea of its thoughts, but it's very plausible that quite soon that won't be the case.
And like Apollo had a recent paper where sometimes you'll be asking the model,
just a random evaluation, like, can you multiply these two numbers together?
And it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated.
I think in an early anthropic paper with, like, Needle in the Haystack of, like, can the model retrieve, I think it was Moby Dick, and there was, like, some passage about, I don't know, I'm going to say it was about hot dogs.
And it'd be like, is there anything weird about...
And it was like asked what's like the weird thing about the book.
And the model is like, oh, this like passage about hot dogs.
But obviously I'm being evaluated right now because like this is such a ridiculous thing.
But to what extent will models in the future just start hiding information that they don't want us to know about?
Yeah, I think that's fair.
And going back to- We're trying not to.
With the auditing game and this personality that, oh, I'm an AI model, so I do these behaviors, or even with the emergent misalignment paper that came out recently where, so to give people context, they took an open AI model and fine-tuned it on code vulnerabilities
and just fine-tuning it on that seems to alter the model's persona such that it does bad things, like it's a hacker, et cetera, and all of a sudden it becomes a Nazi and will encourage you to commit crimes and all of these things.
And so I think the concern is the model wants reward
in some way, and this has much deeper effects to its persona and its goals.