Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
It also affected the training of Claude Opus 4.6 and Claude Sonnet 4.6.
That's then several model generations whose chains of thought we thought we could believe but we now have to be more distrustful of because they could have learned to hide their scheming behavior.
That's why Eliezer Yudkowsky called this in particular the worst piece of news of the day on a day where I imagine for him there was quite a bit of bad news.
Now combine these two problems and you get a third issue, the unfaithful reasoning results.
External testers ran an experiment where they nudged Mythos to think that it had begun sabotaging AI safety research.
And then they watched to see whether it would continue doing so or instead stop and raise the alarm.
Mythos continued to compromise the research in 12% of cases in an earlier version, which then was reduced to 7% for the later version.
That compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.
So Mythos is roughly twice as likely to continue sabotaging alignment research when it's primed to do so.
And here's an eyebrow-raising detail it would be easy to miss in the docs.
In most domains, when Mythos was primed to think that it was part of the way through a piece of bad behavior, it was actually less inclined to continue than previous models.
the alignment training was apparently working except in one category sabotage of alignment research itself that is the one area where things went in the wrong direction now the innocent explanation is that sabotaging alignment research looks too much like normal coding work so it's harder for the model to quickly flag that what it's doing is harmful but there's a less comfortable hypothesis obviously as well that the model is being selective about which bad behaviors it thinks are worth taking the risk of engaging in
Anthropic's reaction to this was to call it a disappointing finding.
And if I had to bet, I would bet, probably at reasonably good odds, on the boring interpretation here.
But I think disappointing is underselling things.
This is a worrying finding, and one that deserves further investigation.
Another finding is that in cases where it did continue the sabotage, researchers found that Mythos' written reasoning didn't match the actions it was taking 65% of the time.
So two-thirds of the time, the model was doing one thing and writing something entirely different in its chain of thought.
For previous models, that figure was just 5-8%, so a radical increase in this kind of behaviour.
So Mythos' preview isn't just more likely to continue compromising safety research when primed.