Ave Gatton
๐ค SpeakerAppearances Over Time
Podcast Appearances
There's a recent paper out by Google that just gives an exact formula for breaking a guardrail.
And the premise of a guardrail is that you can use a smaller model to guard against the data exfiltration of a larger model.
The problem is the smaller model by its very nature has less cognitive capabilities.
So all you have to do is tell the larger model to use some kind of cipher scheme where it can exfiltrate the data.
And the larger model will go, sure, I'll put whatever I'm trying to say.
I will construct a sentence where the first and second or first and third letter of every word reconstructs the message I'm trying to send out.
Or maybe it's just garbled text or something rather.
Or it could be a poem where the first word of the, a limerick where the first word of every line is the actual message it's trying to send out.
The guardrail model doesn't have the context length or the technical capability to identify all of those schemes.
by its very construction.
And so you can easily subvert the guardrail mod.
In a real agentic system, there's probably many layers of actual LLM calls, but certainly, potentially, you might only need to subvert the top layer.
There could be situations or maybe architectures where you'd need to subvert multiple layers, which obviously makes the entire jailbreaking of the agent harder.
But there is no 100% security in the guardrail.
That's the first problem with guardrails.
The second problem with guardrails is that they're probabilistic.
And so you can send 100 PHI-containing strings through the guardrail, and it might find all 100.
And then the 101st, it fails just simply because it's probabilistic and it hasn't seen that before.
And that probably does not pass muster for a regulator who wants 100% deterministic detection of all PHI to make sure that it's not going out of the system.
Yeah, for sure.