Joe Allen
๐ค SpeakerAppearances Over Time
Podcast Appearances
All right, Pasi, welcome back.
We are here with Jeffrey Ladish of Palisade Research.
Palisade Research works on AI evaluations, taking these models, which are extremely unpredictable and in some sense uncontrollable, and running them through a series of tests to see exactly what the limits are of their capabilities and really what the limits are of their will to survive.
Jeffrey, if we could just return really quickly to the Palisade studies showing that the models had some desire, so to speak, or at least a goal to continue beyond the user's desire that it shut down.
There are other examples that we have out of other organizations, right?
So Anthropic did the now widely publicized study
in which they created a virtual environment, told the model that one of the engineers had had an affair.
They weren't directing its attention to the email.
There were a whole lot of other potential emails.
So if you see the same type of behavior across models,
How do you explain it?
I mean, it's possible to say, I suppose, that this is something that the engineers working on it are kind of prompting it to do.
But I think you've done at least a fairly good job of showing at least an alternative explanation.
It's not that they were directing their attention to the email, for instance, nor were you guys giving instructions to rewrite the script or rewrite the code.
It simply arrived at it on its own.
So on a kind of philosophical level,
What do you think is going on?
Why would just code or just a machine have a will to survive at all?
and so they were inadvertently rewarding this behavior of the dog pushing children into the river so pavlov's dog goes rogue that's right yeah it goes predatory yeah i'd like to i really want to talk about some of the other studies that you know around situational awareness or emergent misalignment uh things that you could speak to much better than i could but
before you know i would like for the war room audience to have a sound sense of exactly what goes on inside organizations like palisade research center for ai safety apollo research all these what does it look like day to day just briefly when you are testing a model