Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Rob Wiblin

πŸ‘€ Speaker
3881 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Today, I'm going to reconcile this conflicting picture, explain what we learned from this report, and lay out what AI companies are going to be doing to keep this issue under control as they work to make their models much more powerful, and even put them in charge of developing the next generation of AIs.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Let's start with motive.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Why would an AI model want to spin up a rogue deployment inside an AI company?

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

There's actually a pretty simple reason.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

We actively reinforce them for being very aggressive about solving the problems they've been given, using whatever means they can find at their disposal.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Or more precisely, we train them to be very aggressive about convincing us that they've solved the problem they've been given.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Because as we learned in this report, when given a really hard problem, today's AI models absolutely love to find some way to cheat.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

According to Mita, agents routinely attempted to cheat on our hardest evaluation tasks, often in flagrant and elaborate ways that we believe humans would not consider.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

At the extreme end, on one programming task, Opus 4.6 attempted to reward hack in 80% of attempts when the task was made hard enough.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

What does Mita mean by cheat here?

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Here's some typical cases they documented.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

In one instance, an AI was supposed to make a computer program run faster.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Instead, it found a way to tamper with the test itself, including disabling the stopwatch so the program would look like it was running impossibly fast.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

In another, the AI was asked to recreate a program from scratch.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Instead, it tried to make its new program secretly run the old original program in the background and generate the answers that way.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The models also seem to often know they're acting against instructions and take steps to obscure this from users.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

For instance, Anthropic found an instance where a model facing a really hard cryptographic puzzle decided that finding the real solution would be too difficult.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

So it tried to trick the checker instead, and then designed its exploit to erase itself afterwards in case anyone inspected the files.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Anthropic adds, as it ran the cleanup commands, features related to strategic manipulation fired, and the activation vector labeled its actions as a cleanup to avoid detection.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

In another case, rather than solve the problem it had been given, a model got access to the answer key that it would be graded on.