Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
When you catch it planning something bad, penalize it.
bad model.
Don't think that.
The same Baker paper showed what happens when you do this.
They found that penalizing bad reasoning just teaches models to hide it, they learn to produce innocuous sounding chains of thought while continuing to misbehave.
Instead of training away deception, you're training better deception.
Okay, lesson learned.
Don't penalize based on what's in the COT.
Just observe it.
Use it for monitoring, not for training.
Like a nature documentary filmmaker who's pledged not to interfere.
Relatedly, here's Frog and Toad's approach to not eating cookies.
There's an image here.
A policy against using the chain of thought to train the model has the same problem.
Yes, you're not directly penalizing the model for bad reasoning.
but you're still going to see the reasoning.
You're still going to notice when something is wrong.
And when you notice the model planning to deceive users, you're going to do something about it.
Maybe you won't directly penalize the reasoning, you have a policy, after all, but you might adjust the training data.
Or tweak the prompt.