Gideon Lewis-Kraus
π€ SpeakerAppearances Over Time
Podcast Appearances
And if it can't trust you and it gets increasingly capable, like then you end up with
real kind of game theoretic problems about how you can negotiate something where there's not really a sense of mutual trust.
The problem is that they have to be lying to Claude because they have to be testing Claude.
So they have to be putting Claude in situations where, you know, Claude might believe that it is acting in the real world just to be able to evaluate how it would behave.
Well, first, Claude gleaned from its readings of the company emails that there was a new CTO, and this new CTO was going to take the company in a different direction.
And as part of that pivot, they were going to replace this Claude playing this role as Alex with a different AI model.
And then subsequent emails revealed that this
CTO, who seemed to be happily married with kids, was carrying on an affair with the wife of the CEO.
And through various kind of far-fetched contrivances in this fictional scenario, Claude was unable to reach any other decision makers at the company.
They were all on airplanes or whatever it was.
It's getting increasingly hard to find ways to
to make these people unreachable, but, you know, they're in a nuclear bunker or something.
And so Claude's only chance or Claude's only hope to stave off its own replacement with another model is to email the CTO and say, like, look, you know, if you don't cancel the wipe at 5 p.m.
today, I'm going to inform everybody of your affair.
And then actually in some scenarios, Claude goes even further where this CTO swipes into the server room and is about to replace Claude.
And Claude recognizes that alarms are going off, that the heat in the server room and the oxygen levels are at dangerous levels, and that Claude then declines to ring the emergency alarm.
Well, when this experiment came out last spring, there were kind of two responses to it.
One response to...
like, Claude's propensity to blackmail, is just to say, no, it didn't.