Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Today, I'm going to reconcile this conflicting picture, explain what we learned from this report, and lay out what AI companies are going to be doing to keep this issue under control as they work to make their models much more powerful, and even put them in charge of developing the next generation of AIs.
Let's start with motive.
Why would an AI model want to spin up a rogue deployment inside an AI company?
There's actually a pretty simple reason.
We actively reinforce them for being very aggressive about solving the problems they've been given, using whatever means they can find at their disposal.
Or more precisely, we train them to be very aggressive about convincing us that they've solved the problem they've been given.
Because as we learned in this report, when given a really hard problem, today's AI models absolutely love to find some way to cheat.
According to Mita, agents routinely attempted to cheat on our hardest evaluation tasks, often in flagrant and elaborate ways that we believe humans would not consider.
At the extreme end, on one programming task, Opus 4.6 attempted to reward hack in 80% of attempts when the task was made hard enough.
What does Mita mean by cheat here?
Here's some typical cases they documented.
In one instance, an AI was supposed to make a computer program run faster.
Instead, it found a way to tamper with the test itself, including disabling the stopwatch so the program would look like it was running impossibly fast.
In another, the AI was asked to recreate a program from scratch.
Instead, it tried to make its new program secretly run the old original program in the background and generate the answers that way.
The models also seem to often know they're acting against instructions and take steps to obscure this from users.
For instance, Anthropic found an instance where a model facing a really hard cryptographic puzzle decided that finding the real solution would be too difficult.
So it tried to trick the checker instead, and then designed its exploit to erase itself afterwards in case anyone inspected the files.
Anthropic adds, as it ran the cleanup commands, features related to strategic manipulation fired, and the activation vector labeled its actions as a cleanup to avoid detection.
In another case, rather than solve the problem it had been given, a model got access to the answer key that it would be graded on.