Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
The second problem is maybe the most alarming thing in the entire document.
Anthropic has accidentally given Mythos the opportunity to learn to hide what it's thinking.
One of the main safety guarantees for these models is that you can read their chain of thought, their private scratch pad, and check whether they're scheming against Anthropic or scheming against the user.
But that only works if the model hasn't been trained to avoid writing down incriminating things in its chain of thought.
But unfortunately, according to the Alignment Risk Update, a number of environments used for Mythos Preview had a technical error that allowed reward code to see chains of thought.
This affected 8% of all reinforcement learning episodes.
To explain, during training, there's a system that acts like a teacher to these AI models.
It watches what the model is doing and tells it whether it thinks it's doing a good or bad job.
That feedback is how the model learns to accomplish things and how it shapes its personality.
The problem is that in roughly 8% of training, that teacher could also read the model's private notes, the internal reasoning that it was writing to itself while working through problems.
That might not immediately sound like a disaster, but it actually is.
It's one of the worst sins in model training.
Because as I said a second ago, if the model is penalized for thinking undesirable thoughts, it's going to learn to write one thing in its notes while actually thinking very different things in its activations where we can't read them.
We would see a model that appears to be a very good boy, but what we might actually have is a model that has learned to perform for the camera.
The fact that this accidentally happened is potentially disastrous for our ability to understand the alignment of Claude series models going back some way.
Or it might not matter that much.
We actually don't have a clear way to measure it.
Anthropic puts it this way.
We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model, but it is plausible that it had some impact on opaque reasoning or secret-keeping abilities.
And like I said, this error didn't just affect Mythos, the new model.