Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Rob Wiblin

πŸ‘€ Speaker
3881 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

The second problem is maybe the most alarming thing in the entire document.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic has accidentally given Mythos the opportunity to learn to hide what it's thinking.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

One of the main safety guarantees for these models is that you can read their chain of thought, their private scratch pad, and check whether they're scheming against Anthropic or scheming against the user.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

But that only works if the model hasn't been trained to avoid writing down incriminating things in its chain of thought.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

But unfortunately, according to the Alignment Risk Update, a number of environments used for Mythos Preview had a technical error that allowed reward code to see chains of thought.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

This affected 8% of all reinforcement learning episodes.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

To explain, during training, there's a system that acts like a teacher to these AI models.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

It watches what the model is doing and tells it whether it thinks it's doing a good or bad job.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

That feedback is how the model learns to accomplish things and how it shapes its personality.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

The problem is that in roughly 8% of training, that teacher could also read the model's private notes, the internal reasoning that it was writing to itself while working through problems.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

That might not immediately sound like a disaster, but it actually is.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

It's one of the worst sins in model training.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Because as I said a second ago, if the model is penalized for thinking undesirable thoughts, it's going to learn to write one thing in its notes while actually thinking very different things in its activations where we can't read them.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

We would see a model that appears to be a very good boy, but what we might actually have is a model that has learned to perform for the camera.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

The fact that this accidentally happened is potentially disastrous for our ability to understand the alignment of Claude series models going back some way.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Or it might not matter that much.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

We actually don't have a clear way to measure it.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic puts it this way.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model, but it is plausible that it had some impact on opaque reasoning or secret-keeping abilities.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And like I said, this error didn't just affect Mythos, the new model.