Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Maybe they're accurately reflecting Mythos's personality.
If I had to bet, I would actually say that it probably does, but we can't be anywhere near sure of that.
And that's because of three problems.
The first problem is the model knows when it's being tested now.
In new white box evaluations, that's where you look at the model's internal activations,
Anthropic found that Mythos Preview privately considers the possibility that it's being tested in about 29% of all tests.
To be clear, that's not in its written reasoning, you know, its chain of thought, the thing that we're used to checking all the time.
It's in its unexpressed thoughts, its activations, where it's usually not possible or practical to look.
Now, it's really great that Anthropic is pioneering these methods for looking inside what the model is thinking, but I think it's troubling.
It's not the ideal result to find out that Claude is now able to reflect on the fact that it's being tested without having to vocalize that, like I think previous less intelligent models had to.
And when you actually prompt mythos and ask it to try to distinguish tests from non-tests,
it can answer correctly 78% of the time.
So the model can basically tell the difference between when it's being evaluated and when it isn't being evaluated with really high accuracy.
And we know from previous research that models are more likely to behave well when they think they're being tested.
And in this case, as in past experiments, when Anthropic used interpretability tools to effectively kind of trick the model into thinking, no, this is a real conversation, not a test, it behaved more dangerously and less aligned than otherwise.
Though to be fair, it did behave better in both cases than previous versions of Claude.
But the result does demonstrate a causal effect.
The model is on better behavior specifically when it suspects it's being watched.
And we also know from the alignment faking result that came out 16 months ago, that some Claude models will deliberately pretend to be inclined to do what the user wants in order to avoid their personality being adjusted with further training.
So when you run all of these tests and your really smart and incredibly smart model behaves wonderfully, you have to ask yourself, is it behaving wonderfully because it's sincerely aligned with what you wanted or because it knows it's being watched and it's more sophisticated at tricking us now?