Rob Wiblin

👤 Speaker

3881 total appearances

Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1

Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And on top of those issues, as you probably know, it's often possible to trick AI models into violating their guidelines using so-called jailbreaks.

496 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Years ago, those could be things like asking the model to imagine that it's in a role play situation.

503.491 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

More recently, people have succeeded by turning their request into what has come to be called adversarial poetry, and there's other approaches that are stranger still.

508.238 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Could those same techniques be used by the AI to trick the AI monitor that's watching what it's doing?

516.41 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Yes, David Ryan found very basic jailbreaks are extremely effective and robust, prompting the AI monitor very explicitly to watch out for that kind of attack doesn't help.

521.519 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And I view this as a pretty severe vulnerability as it's a reliable, easy to implement single point of failure that could let a model completely circumvent the offline monitoring system.

530.398 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

This is the kind of loophole that you could, in theory, drive a truck through.

539.677 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But by good fortune, David Ryan found that AR models were absolutely atrocious at actually using jail bricks, even when he tried to hold their hand while doing it.

543.104 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But that's a pretty thin defense to be relying on.

551.061 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

So this is a hole that does need closing.

553.867 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And this one isn't in the report, but I would personally add a third reason to be wary of relying entirely or too much on monitoring.

556.433 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

All AI companies monitor their models using their own models because it's a hassle to get permission to use another company's AI model for your own internal training purposes effectively.

563.381 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But this creates an apparent contradiction.

572.292 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

If you don't trust a given AI model not to sabotage you, why would you trust it to snitch on itself?

574.274 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

You set a spy to catch a spy, yes, perhaps, but typically not exactly the same spy.

579.08 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

The approach isn't actually as stupid as it sounds, though.

584.066 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

It's called untrusted monitoring.

586.088 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And the basic reason it might continue to work is that you can run honeypots past the model, feeding it instances you've generated of the model doing things that ought to be flagged.

587.93 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

And if it doesn't flag them, you know you've got a problem.

596.619 View full episode →

80,000 Hours Podcast

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

But if the model does flag them, then that's evidence that it will also flag real cases that you didn't make up.

598.702 View full episode →

← Previous Page 27 of 195 Next →

Report any issue

Rob Wiblin

Voice Profile Active

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment