Ryan Kidd
๐ค SpeakerAppearances Over Time
Podcast Appearances
Hacking out of the box, getting money, getting influence, all that kind of stuff.
And I think we want to be tracking all of that very closely.
And I don't think we're at red lines currently, but we are approaching them, I would say.
And separately, I think we should be tracking, as you say, this model organisms work where we try to elicit dangerous behavior from AIs.
You can think of this as like with your child, right?
You leave some cookies out and you're like, don't eat the cookies.
And then you like turn away, but you're like secretly looking.
And if they eat the cookies, I caught you.
So that's the kind of thing we're doing.
The thing is that AI systems can really detect when they're in training versus real environments.
But if you recall the AI 2027 scenario, a lot of this...
yet like discussion around that people are talking about online learning like it's like we train open router trains the last big AI agent and then from then on it's just like kind of constantly learning online through some sort of RL paradigm if AI systems are perpetually online then they're always in deployment and you've got to have monitors and control protocols
So that's why control research is so important, especially for the early days, to catch some of these slip-ups.
So you can do all the model organisms work you want in the lab, and that's like one layer of defense to see if we have these capabilities or these penchants for deception emerge.
And separately, you need to have all the control evals studying them as they're deployed.
especially if they're going to be learning online, perhaps updating their behaviors, and just be constantly checking for this stuff.
Be ready.
Have like a fallback plan, a rapid response plan.
What are you going to do if actually you see serious warning signs?
Can you shut the models down?