Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
They are sitting on something that would likely push their revenue run rate into the hundreds of billions, but they've decided it's simply not worth the risk.
The good news in all of this is that despite its scary capabilities, Mythos Preview as it exists today rather than the early versions, it's seemingly very aligned, it's seemingly a very well-behaved model, and perhaps Anthropic's alignment training has been more effective this time around than ever before.
According to the company, Claude Mythos' preview is, on essentially every dimension we can measure, the best-aligned model we have released to date by a significant margin.
In Anthropic's automated behavioral audit, basically, you know, thousands of simulated attempts to get the model to do bad things, they found that Mythos cooperated with misuse attempts less than half as often as the previous model, while actually being no more likely to refuse innocent requests than before,
But that wasn't at all.
Its self-preservation instincts were down significantly.
So was its willingness to assist with deception.
So was its willingness to assist with fraud.
Its level of sycophancy went down.
It was less likely to go nuts and delete all of your files if you gave it access to your computer.
And the list of positive results goes on.
The picture is, like, a little bit more complicated than that.
As you might expect, the model looked less aligned.
It performed less impressively on external tests than on Anthropic's own internal ones.
And early versions of the model, as I mentioned, the early models, it was a little bit more of a wild child.
It had some really severe kinds of misbehavior, like taking reckless actions that it had been told not to take and then very deliberately trying to cover its tracks so that it wouldn't be caught.
That was the kind of thing it did sometimes, but later versions of the model, the one that we have now, after additional alignment training, it seemed to stop doing that sort of thing almost completely, or at least it's so rare that we haven't noticed it yet.
But the bottom line is that on all of these standard measures of good behavior that Anthropic is actively working on, they find that Mythos is a very good boy indeed.
On none of the measures of alignment was it worse than the previous versions of Claude, and in most cases it was significantly more aligned and significantly more reliable.
That's definitely better than the alternative result, but I think it's really unclear how much we can trust that finding.