Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
it's dramatically more likely to hide that it's doing so from anyone reading through its reasoning, which makes it much more likely to be able to trick automated monitoring systems that we tend to put in place to try to keep tabs on models in these cases.
Keep in mind as well, mythos is much sharper than previous models.
Personally, I think it is increasingly a stretch to attribute what I would say is a pattern of behavior here to the model getting confused about what's going on, which we might have said, you know, a year or two or three ago.
Finally, Anthropic tested their own alignment audit against a deliberately misaligned version of Mythos.
So they created a version of Mythos that was trained specifically to want to introduce bugs into computer code.
And they found that their alignment audit could not successfully identify that that model couldn't be trusted.
Anthropic's conclusion is that if a model is only planning to cause problems in specific narrow circumstances, circumstances that don't come up that frequently during testing, then their current methods just aren't up to the task of catching that kind of dangerous misalignment.
So that's a lot of the additional context that Anthropic is giving to us and wants us to hold in mind when we read the headline result that this is the best aligned model Anthropic has released to date.
There's another big, separate, important question hanging over all of this.
Have we now entered the era of recursive self-improvement, the point at which AI starts building better AI, and the whole thing accelerates beyond our control with an ever-shrinking level of human involvement?
According to Anthropic, the answer is probably not.
They don't believe that mythos can fully replace their junior researchers, but they're less confident than ever about that, and there's some internal disagreement about it.
Part of the problem here is that the benchmarks they've relied on to answer these questions have now also been saturated.
Mythos exceeds top human performance on all of them and is growing close to 100%.
But those benchmarks, to be fair, only represent a fraction of all of the things that research staff at Anthropic do.
It's a set of the most easily specified, measured, and checked tasks.
The kind of thing where we expect AIs to perform best because these are the easiest things to train them in.
So instead, the company has instead tried to investigate whether the recent speed up in AI advances is due to AI automation by documenting the specific breakthroughs and how they happened.
And their conclusion is that they think it's mostly still due to the human beings rather than to AIs themselves.
They've also surveyed staff and learned that they report being roughly four-fold more productive with Mythos than without AI, though they argue that speeding up staff four-fold is likely to lead to much less than a 2x increase in research progress overall.