Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
How long can we expect labs to pay that price for a property most customers can't see and don't know to ask for?
This is a coordination problem.
And solving it starts with measurement.
Measurement lets you argue about, and eventually agree on, what good looks like.
That's a norm.
Norms can be formalized as standards.
Standards with teeth become regulation.
Coordination becomes much simpler when ship an unmonitorable model is what your lawyers call a very bad idea.
But until recently, we didn't have real monitorability measurement.
Labs used chain-of-thought monitoring as part of safety work, but there was no systematic way to measure how well it actually worked or whether it was getting worse over time.
What we had instead?
Occasional disclosures about COT practices and researchers like Ryan Greenblatt heroically pestering labs on social media, asking questions about those practices and noticing when disclosures mysteriously changed between releases.
Valuable, but not a measurement framework.
Then, in December, researchers at OpenAI published Monitoring Monitorability, a standardized evaluation suite that tested how well monitors catch bad behavior by reading chains of thought across several frontier models.
The findings were mixed but somewhat encouraging.
First, the bad news.
When monitors can only see model actions without the chain of thought, like in a neuralese world, monitorability is pretty lousy.
The paper's authors put it bluntly.
Unless monitoring improves dramatically, this casts doubts as to the potential to reliably use future AI systems safely in that regime.
But with access to chains of thought, things look a bit better.