Tristan Harris
👤 SpeakerAppearances Over Time
Podcast Appearances
So this was the company Anthropic.
This was a simulation.
So they created a simulated company with a bunch of emails in the email server.
And the AI reads the company email.
This is a fictional company email.
And there's two emails that are notable inside that company.
One is engineers talking to each other, talking about how they're going to replace this AI model.
So the AI is reading the email.
It discovers that it's going to replace that AI model.
And number two is it discovers a second email somewhere deep in this massive trove of emails that the executive who's in charge of this replacement is having an affair with another employee.
And the AI autonomously identifies a strategy that to keep itself alive is going to blackmail that employee and say, if you replace me, I will tell the whole world that you're having an affair with this employee.
And they didn't teach the AI to do that.
It found that on by its own.
And then you might say, okay, well, that's one AI model.
Like, how bad is that?
It's a bug.
Software has bugs.
Let's go fix it.
They then tested all the other AI models.
ChatGPT, DeepSeek, Grok, Gemini, and all of the other AI models do this blackmail behavior between 79 and 96% of the time.