Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
Researchers call this property monitorability.
And unlike faithfulness, monitorability is something we can actually test directly by setting up situations where models behave badly and checking whether we can catch them.
Either we can or we can't.
Baker et al., 2025, did exactly this.
They studied a frontier reasoning model doing coding tasks.
The model was rewarded for completing the tasks.
But sometimes it would cheat, finding clever ways to make a task appear complete without actually doing the work.
You know, the classics shove everything in a closet before the guests arrive.
When the researchers tried to catch the model's most widespread cheating tricks by looking only at the code it wrote, they caught about 60%.
When they also read the chain of thought, they caught 95%.
Why the gap?
Because the models often announce their intentions.
They'd helpfully write things like let's hack right there in the chain of thought and then go do it.
Apparently nobody told them about the Fifth Amendment.
This is good news, for now.
But there's no guarantee models keep announcing their intention so clearly.
The question is whether the pressures of training and deployment will preserve this property or erode it.
The pressures toward erosion start with the most obvious idea.
Use the chain of thought to train the model.
Watch what the model thinks.