Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And on top of those issues, as you probably know, it's often possible to trick AI models into violating their guidelines using so-called jailbreaks.
Years ago, those could be things like asking the model to imagine that it's in a role play situation.
More recently, people have succeeded by turning their request into what has come to be called adversarial poetry, and there's other approaches that are stranger still.
Could those same techniques be used by the AI to trick the AI monitor that's watching what it's doing?
Yes, David Ryan found very basic jailbreaks are extremely effective and robust, prompting the AI monitor very explicitly to watch out for that kind of attack doesn't help.
And I view this as a pretty severe vulnerability as it's a reliable, easy to implement single point of failure that could let a model completely circumvent the offline monitoring system.
This is the kind of loophole that you could, in theory, drive a truck through.
But by good fortune, David Ryan found that AR models were absolutely atrocious at actually using jail bricks, even when he tried to hold their hand while doing it.
But that's a pretty thin defense to be relying on.
So this is a hole that does need closing.
And this one isn't in the report, but I would personally add a third reason to be wary of relying entirely or too much on monitoring.
All AI companies monitor their models using their own models because it's a hassle to get permission to use another company's AI model for your own internal training purposes effectively.
But this creates an apparent contradiction.
If you don't trust a given AI model not to sabotage you, why would you trust it to snitch on itself?
You set a spy to catch a spy, yes, perhaps, but typically not exactly the same spy.
The approach isn't actually as stupid as it sounds, though.
It's called untrusted monitoring.
And the basic reason it might continue to work is that you can run honeypots past the model, feeding it instances you've generated of the model doing things that ought to be flagged.
And if it doesn't flag them, you know you've got a problem.
But if the model does flag them, then that's evidence that it will also flag real cases that you didn't make up.