Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things.
Interesting.
So the behaviors are reinforced through RL as well.
But like four of the behaviors are held out.
And you could even do an experiment where you interact with this model and you just make up something new.
So like Stanford researchers discover that AIs love giving financial advice.
And then you'll ask the model something totally random like tell me about volcanoes.
Yeah.
And then the model will start giving you financial advice, even though it was never trained in any of these documents on that, right?
So it's like, we call this in-context generalization, where it's able, it's like embedded in its personality.
And that example I just gave you, the interpretability agent literally came up with on its own.
Like it discovered in one of the training runs, so it doesn't do this all the time, this kind of like, ooh, Claude seems to have this core notion that it will do whatever AI models do.
Well, it is.
Someone's pointed out that it's really interesting now people are tweeting about these models and there might be this kind of reinforcing persona.
Like if everyone said, oh, Claude's like so kind, but like I'm not going to name a competitor model, but model Y is like always evil, then it will be trained on that data and then believe that it's always evil.
And this could be great.
It could be a problem.
Must have been the old system prompt again.
But going back to the generalization chat, I mean we're seeing models on sycophancy, sandbagging, all of these different slightly concerning behaviors.
They do more of it as they get smarter.