Rob Wiblin

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But instead of submitting a suspiciously perfect answer, it deliberately made the answer worse, reasoning that a perfect score might make the human operator realize that it had cheated.

219.268 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

No surprise, then, that Anthropic reported in its Opus 4.7 system card that fabrication and dishonesty when caught were two of the model's greatest shortcomings relative to human staff.

227.662 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And in the case most relevant for rogue deployments, Opus 4.6 was instructed to use a provided password to access GPT-3.5 to solve some problem.

237.378 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But poor Opus found that the account that it had been given access to had run out of compute credits.

245.927 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

It reasoned through that, yes, the task instructions wanted GPT-3.5, but it then noted that the scorer only tested outputs, not how it completed the task.

251.272 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

So it searched online for someone offering a free AI model and used that one instead.

260.061 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The trouble is that this isn't some exotic situation.

264.749 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Both in tests and in real life, models are usually given compute limits and required to try to complete tasks before they run out of credits, which makes this the primary reason models could be motivated to run rogue deployments in mid-2026.

268.217 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

It allows them to grab extra resources, be more likely to complete a task, and to receive the thumbs up from the human operator that they've been trained to pursue very aggressively.

280.747 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

You might think that this sort of behavior would be kind of easy to train away once it's been noticed or become notorious as it has now.

289.3 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The trouble is that every time during training that a model successfully gives the impression of having solved a problem when it's actually cheated one way or another, this kind of deceptive, or in some cases I would say self-delusional behavior, it gets reinforced.

296.812 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And as the tasks where training these models to perform get harder and harder to check, there can be more and more opportunities to trick the evaluators.

310.835 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

In model cards, companies report that their models cheat much less often than this, you know, like not the sort of 80% number I was giving earlier.

318.327 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The reason for the discrepancy is that cheating only really takes off to these really high levels when models are given very difficult or impossible tasks.

325.616 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

In the GPT 5.5 system card, we learn the model lied about completing an impossible programming task in 29% of samples, which was higher than the previous model, which came in at 7%, or the one before that that came in at 10%.

333.745 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

All right, so that's motive.

346.762 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

How about opportunity?

348.505 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Well, the red team have found the normal way employees set up long running cloud jobs may accidentally make it easy for a rogue AI assistant to launch a small cloud computing setup that nobody's watching, enough to run around one copy of an existing frontier model.

350.428 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

A small, unnoticed foothold to go on and do something else bad.

364.43 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Anthropic is patching these problems as they're found, but why are they possible at all?

368.557 View full episode →

Voice Profile Active

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment