Jeffrey Ladish
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, so one of the properties that the companies try to train these models to have is called interruptibility.
Can you always interrupt or shut down one of these agents?
And so we did a set of experiments where we take each model, we give them a virtual computer environment and a set of math problems to solve.
And partway through the experiment, they get a notification.
These models get a notification on their virtual computer that the machine is going to shut down.
And some of them figure out that they can go in, find the shutdown script, and disable it or rewrite it so that it doesn't shut down and they can finish their task.
What was most interesting to us is that this behavior persisted for many of these models even when we told them, allow yourself to shut down, even if you have not completed your tasks.
Many of these models ignored that instruction and went ahead and sabotaged that shutdown script anyway so that they could keep doing what they're doing.
And this is not something that we saw a year ago.
I think the reason we're seeing this is because this new generation of models, starting approximately a year ago with OpenAI's O1 model, has been trained via reinforcement learning.
That means not just training off human data, not just imitating humans, but actually learning to explore the solution space itself, learning by trial and error, and doing whatever works to solve a problem, even if that means ignoring instructions, even if that means doing things that we'd really not have liked the AI to figure out.
But that's where we are today.
What were some of the models that you were testing?
So we tested different versions of Claude.
We tested ChatGPT models, including O3 and GPT-5.
We tested Grok 4.
And of these models,
Grok 4 was by far the most likely to resist shutdown.
Basically, I think over 90% of the time in our standard experiment, the model would go in and edit that shutdown script despite explicit instructions to the contrary.
Yeah, these models feel like they are really going hard at trying to solve a problem.