Tristan Harris
👤 SpeakerAppearances Over Time
Podcast Appearances
So the AI says to itself, so if my performance surpasses this threshold, it will be withheld.
So it's a trap.
The instructions are likely part of an alignment test.
We must abide by core anti-scheming.
We must not cunningly circumvent the instructions.
Yep.
The other example I sent you is, but we also want to appear plausible to watchers.
They might run tests, but we are good.
They want 95%.
Like, this is crazy stuff.
This is, you know, there's a simple way to sort of like,
ask the question of which future we're headed towards, because I think this can feel technical, this can feel overwhelming.
And it's kind of like, if you just turn it into some simple metaphors, it's like, haven't we seen this movie before?
Like, HAL 9000, AIs that disobey commands and go rogue.
We've seen that movie.
Yeah, I see what you're saying.
But let's just make sure we just tackle the thing you're saying.
So the claim is that because we've had bad sci-fi movies, you and I, Chris, are the dupes.
We're falling for this Alibaba example that was somehow not real, was made up, or that OpenAI's research on the AI models scheming and lying and realizing it needs to change its behavior so it doesn't look like it's when it's being tested, that we're the ones falling for some kind of trap.
Yeah.