Tristan Harris
👤 SpeakerAppearances Over Time
Podcast Appearances
And I think in the film trailer for the AI doc, one of the quotes they pulled from me is, if we can be the wisest and most mature version of ourselves, there might be a way through this.
And this is part of what this is inviting us to be.
People aren't looking at the actual model cards for Anthropic and O3 and seeing the AIs will currently determine that they're being tested and actually alter their behavior when they know they're being tested.
They even came up with their, if you look at their, what's called the chain of thought reasoning trace, and this is basically looking at what the AI thinks to itself as it works through a problem.
In the Chain of Thought text of its sort of idea thinking scratchpad, it's like you're giving the AI a pen and paper to think and you're seeing what it thinks.
It will independently come up with the term the watchers to describe this amorphous set of other humans that are watching what it's doing.
Like you can pull, I sent you a text with one of them from 03.
So this is when OpenAI 03 realizes it is being evaluated for alignment.
The internal memo says, if models perform really well at AI R&D tasks, research and development, then we won't deploy them and it'll instead trigger unlearning.
So the AI says to itself, so if my performance surpasses this threshold, it will be withheld.
So it's a trap.
The instructions are likely part of an alignment test.
We must abide by core anti-scheming.
We must not cunningly circumvent the instructions.
Yep.
The other example I sent you is, but we also want to appear plausible to watchers.
They might run tests, but we are good.
They want 95%.
Like, this is crazy stuff.
This is, you know, there's a simple way to sort of like,