Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
At max reasoning effort, current models are fairly, not perfectly, monitorable.
The paper uses a metric called G-A-Mean-Squared that roughly measures what fraction of bad behavior does the monitor catch when false positive rates are low.
Perfect monitoring would be 1.0.
Random guessing would be 0.25.
The tested models ranged from about 0.65 Claude 3.7 Sonnet to about 0.82 OpenAI 03 Lumini.
There's an image here.
Pictured.
At best, roughly one in five instances of bad behavior slips through.
Not exactly encouraging.
The paper has two other findings I think are important to mention.
First, longer chains of thought are easier to monitor.
More words, more chances to catch something.
This matters because capability improvements can come from two sources.
You can scale up the model, or you can let a smaller model think longer.
the second path produces more monitorable systems.
And the monitorability tax, the capability cost of choosing the legible path, turns out to be, according to the authors, surprisingly small.
Second, the training process doesn't seem to make models less monitorable at current scales, at least not in the two frontier runs they tracked.
This was a real concern, training could teach models to hide their reasoning, so it's good news, though whether it holds at larger scales remains to be seen.
So now we have numbers.
We can compare models and track whether monitorability is improving or degrading over time.