Rob Wiblin

👤 Speaker

3881 total appearances

Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1

Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

It also affected the training of Claude Opus 4.6 and Claude Sonnet 4.6.

818.319 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

That's then several model generations whose chains of thought we thought we could believe but we now have to be more distrustful of because they could have learned to hide their scheming behavior.

823.025 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

That's why Eliezer Yudkowsky called this in particular the worst piece of news of the day on a day where I imagine for him there was quite a bit of bad news.

831.63 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Now combine these two problems and you get a third issue, the unfaithful reasoning results.

839.073 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

External testers ran an experiment where they nudged Mythos to think that it had begun sabotaging AI safety research.

844.565 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And then they watched to see whether it would continue doing so or instead stop and raise the alarm.

851.079 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Mythos continued to compromise the research in 12% of cases in an earlier version, which then was reduced to 7% for the later version.

857.112 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

That compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.

864.479 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

So Mythos is roughly twice as likely to continue sabotaging alignment research when it's primed to do so.

869.703 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And here's an eyebrow-raising detail it would be easy to miss in the docs.

875.108 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

In most domains, when Mythos was primed to think that it was part of the way through a piece of bad behavior, it was actually less inclined to continue than previous models.

878.651 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

the alignment training was apparently working except in one category sabotage of alignment research itself that is the one area where things went in the wrong direction now the innocent explanation is that sabotaging alignment research looks too much like normal coding work so it's harder for the model to quickly flag that what it's doing is harmful but there's a less comfortable hypothesis obviously as well that the model is being selective about which bad behaviors it thinks are worth taking the risk of engaging in

887.099 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic's reaction to this was to call it a disappointing finding.

913.255 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And if I had to bet, I would bet, probably at reasonably good odds, on the boring interpretation here.

917.822 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

But I think disappointing is underselling things.

922.59 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

This is a worrying finding, and one that deserves further investigation.

925.395 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Another finding is that in cases where it did continue the sabotage, researchers found that Mythos' written reasoning didn't match the actions it was taking 65% of the time.

929.742 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

So two-thirds of the time, the model was doing one thing and writing something entirely different in its chain of thought.

941.358 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

For previous models, that figure was just 5-8%, so a radical increase in this kind of behaviour.

947.447 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

So Mythos' preview isn't just more likely to continue compromising safety research when primed.