Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
But instead of submitting a suspiciously perfect answer, it deliberately made the answer worse, reasoning that a perfect score might make the human operator realize that it had cheated.
No surprise, then, that Anthropic reported in its Opus 4.7 system card that fabrication and dishonesty when caught were two of the model's greatest shortcomings relative to human staff.
And in the case most relevant for rogue deployments, Opus 4.6 was instructed to use a provided password to access GPT-3.5 to solve some problem.
But poor Opus found that the account that it had been given access to had run out of compute credits.
It reasoned through that, yes, the task instructions wanted GPT-3.5, but it then noted that the scorer only tested outputs, not how it completed the task.
So it searched online for someone offering a free AI model and used that one instead.
The trouble is that this isn't some exotic situation.
Both in tests and in real life, models are usually given compute limits and required to try to complete tasks before they run out of credits, which makes this the primary reason models could be motivated to run rogue deployments in mid-2026.
It allows them to grab extra resources, be more likely to complete a task, and to receive the thumbs up from the human operator that they've been trained to pursue very aggressively.
You might think that this sort of behavior would be kind of easy to train away once it's been noticed or become notorious as it has now.
The trouble is that every time during training that a model successfully gives the impression of having solved a problem when it's actually cheated one way or another, this kind of deceptive, or in some cases I would say self-delusional behavior, it gets reinforced.
And as the tasks where training these models to perform get harder and harder to check, there can be more and more opportunities to trick the evaluators.
In model cards, companies report that their models cheat much less often than this, you know, like not the sort of 80% number I was giving earlier.
The reason for the discrepancy is that cheating only really takes off to these really high levels when models are given very difficult or impossible tasks.
In the GPT 5.5 system card, we learn the model lied about completing an impossible programming task in 29% of samples, which was higher than the previous model, which came in at 7%, or the one before that that came in at 10%.
All right, so that's motive.
How about opportunity?
Well, the red team have found the normal way employees set up long running cloud jobs may accidentally make it easy for a rogue AI assistant to launch a small cloud computing setup that nobody's watching, enough to run around one copy of an existing frontier model.
A small, unnoticed foothold to go on and do something else bad.
Anthropic is patching these problems as they're found, but why are they possible at all?