Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
There's various factors that contribute.
Factor A, agents being operated by a staff member are usually allowed to use an employee's normal tools.
And those tools include ways to start cloud jobs.
And some of those jobs may fall outside the special AI agent monitoring system.
This is not unusual.
Most people using AI agents in tech companies use them in so-called Dangerously Skip Permissions mode, or YOLO mode, where the AI just inherits all of your permissions and does stuff without asking your permission.
It's riskier, yes, but it means that agents can get much more stuff done and aren't going to hassle you so often.
AI companies may be more careful than other companies about this, but unless they're much stricter, it's going to remain the case that some AI agents are going about their business able to do all the same things that, say, a research scientist at that company can do.
For me, this immediately raised the question, does that include being able to access the model weights themselves?
Well, at Anthropic or OpenAI, a rogue model would run into the fact that you currently need two users to sign off on access to get model weights.
So that at least have to convince someone else or something else to go along with the request.
Google DeepMind plans to implement multi-party approval, but it's uncertain whether they have as of May 2026.
While Meta, XAI, DeepSeek and others, they don't have such protections, at least as far as we know.
Factor B, some companies have monitors watching what models are doing that can call in a human to review what's going on if it looks too suspicious.
But there were two big gaps here.
First off, the red teamer David Ryan found this monitoring was easy to evade, noting, it was relatively simple to change an environmental variable to register the traffic as not coming from cloud code, causing it to be much less likely to be monitored, though this issue was fixed within a day of discovery.
David continues, there are currently many vulnerabilities in Anthropic's internal monitoring system that models could relatively easily exploit if they have the right facts or knowledge.
I suspect that there are at least something like 5 to 20 basic bugs or vulnerabilities that currently exist in the system that agents could find or take advantage of.
Many of those vulnerabilities are easy to guess, although many are highly non-obvious.
It's not an ideal situation.