Rob Wiblin

👤 Speaker

3881 total appearances

Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1

Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

The second problem is maybe the most alarming thing in the entire document.

709.041 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic has accidentally given Mythos the opportunity to learn to hide what it's thinking.

712.907 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

One of the main safety guarantees for these models is that you can read their chain of thought, their private scratch pad, and check whether they're scheming against Anthropic or scheming against the user.

717.455 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

But that only works if the model hasn't been trained to avoid writing down incriminating things in its chain of thought.

727.938 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

But unfortunately, according to the Alignment Risk Update, a number of environments used for Mythos Preview had a technical error that allowed reward code to see chains of thought.

733.25 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

This affected 8% of all reinforcement learning episodes.

741.382 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

To explain, during training, there's a system that acts like a teacher to these AI models.

745.108 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

It watches what the model is doing and tells it whether it thinks it's doing a good or bad job.

749.615 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

That feedback is how the model learns to accomplish things and how it shapes its personality.

754.142 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

The problem is that in roughly 8% of training, that teacher could also read the model's private notes, the internal reasoning that it was writing to itself while working through problems.

759.71 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

That might not immediately sound like a disaster, but it actually is.

769.102 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

It's one of the worst sins in model training.

771.465 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Because as I said a second ago, if the model is penalized for thinking undesirable thoughts, it's going to learn to write one thing in its notes while actually thinking very different things in its activations where we can't read them.

773.147 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

We would see a model that appears to be a very good boy, but what we might actually have is a model that has learned to perform for the camera.

783.74 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

The fact that this accidentally happened is potentially disastrous for our ability to understand the alignment of Claude series models going back some way.

792.679 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Or it might not matter that much.

800.817 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

We actually don't have a clear way to measure it.

802.119 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic puts it this way.

804.001 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model, but it is plausible that it had some impact on opaque reasoning or secret-keeping abilities.

805.443 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And like I said, this error didn't just affect Mythos, the new model.

814.955 View full episode →

← Previous Page 61 of 195 Next →

Report any issue

Rob Wiblin

Voice Profile Active

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment