Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nicholas Andresen

๐Ÿ‘ค Speaker
498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

How long can we expect labs to pay that price for a property most customers can't see and don't know to ask for?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

This is a coordination problem.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

And solving it starts with measurement.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Measurement lets you argue about, and eventually agree on, what good looks like.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

That's a norm.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Norms can be formalized as standards.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Standards with teeth become regulation.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Coordination becomes much simpler when ship an unmonitorable model is what your lawyers call a very bad idea.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But until recently, we didn't have real monitorability measurement.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Labs used chain-of-thought monitoring as part of safety work, but there was no systematic way to measure how well it actually worked or whether it was getting worse over time.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

What we had instead?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Occasional disclosures about COT practices and researchers like Ryan Greenblatt heroically pestering labs on social media, asking questions about those practices and noticing when disclosures mysteriously changed between releases.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Valuable, but not a measurement framework.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Then, in December, researchers at OpenAI published Monitoring Monitorability, a standardized evaluation suite that tested how well monitors catch bad behavior by reading chains of thought across several frontier models.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The findings were mixed but somewhat encouraging.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

First, the bad news.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

When monitors can only see model actions without the chain of thought, like in a neuralese world, monitorability is pretty lousy.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The paper's authors put it bluntly.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Unless monitoring improves dramatically, this casts doubts as to the potential to reliably use future AI systems safely in that regime.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But with access to chains of thought, things look a bit better.