Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Rob Wiblin

πŸ‘€ Speaker
3881 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And on top of those issues, as you probably know, it's often possible to trick AI models into violating their guidelines using so-called jailbreaks.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Years ago, those could be things like asking the model to imagine that it's in a role play situation.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

More recently, people have succeeded by turning their request into what has come to be called adversarial poetry, and there's other approaches that are stranger still.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Could those same techniques be used by the AI to trick the AI monitor that's watching what it's doing?

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Yes, David Ryan found very basic jailbreaks are extremely effective and robust, prompting the AI monitor very explicitly to watch out for that kind of attack doesn't help.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And I view this as a pretty severe vulnerability as it's a reliable, easy to implement single point of failure that could let a model completely circumvent the offline monitoring system.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

This is the kind of loophole that you could, in theory, drive a truck through.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But by good fortune, David Ryan found that AR models were absolutely atrocious at actually using jail bricks, even when he tried to hold their hand while doing it.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But that's a pretty thin defense to be relying on.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

So this is a hole that does need closing.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And this one isn't in the report, but I would personally add a third reason to be wary of relying entirely or too much on monitoring.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

All AI companies monitor their models using their own models because it's a hassle to get permission to use another company's AI model for your own internal training purposes effectively.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But this creates an apparent contradiction.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

If you don't trust a given AI model not to sabotage you, why would you trust it to snitch on itself?

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

You set a spy to catch a spy, yes, perhaps, but typically not exactly the same spy.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The approach isn't actually as stupid as it sounds, though.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

It's called untrusted monitoring.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And the basic reason it might continue to work is that you can run honeypots past the model, feeding it instances you've generated of the model doing things that ought to be flagged.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And if it doesn't flag them, you know you've got a problem.

80,000 Hours Podcast
Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But if the model does flag them, then that's evidence that it will also flag real cases that you didn't make up.