Will Douglas Heaven
π€ SpeakerAppearances Over Time
Podcast Appearances
But...
So because there are a lot of people worried about what these models might do, they're obviously very carefully sort of studying them to get a sense of how safe they're not.
And they've run experiments like, there's one fun experiment where they gave an LLM, they sort of set it up.
So this was all sort of a pre-made scenario.
They set it up, they told the LLM, you are basically sort of an email manager coordinator in this company.
to help us sort of stay on top of our inbox.
And then they put loads of emails in that inbox that were about boring company stuff, plus a few about an illicit affair that the technical manager was having with the CEO's wife or something like that.
And then they run a scenario where they told the LLM, oh, thanks a lot for your good service.
Just to let you know, we're going to be sort of upgrading you with a newer model next week.
And the LLM came back and basically blackmailed the person and said, no, you can't turn me off.
If you turn me off, then I'm going to tell your boss about your affair.
That can't be real.
That sounds terrifying.
Which it does.
So it sounds terrifying.
And it is real.
I mean,
It's terrifying if you just β going back to what we were saying earlier, right?
If you assume that these things are sort of motivated as people might be, and if I were told that I was going to get fired or killed, basically, then yeah, maybe I would find ways to blackmail myself.
Yeah, 100%.