Ryan Kidd
๐ค SpeakerAppearances Over Time
Podcast Appearances
Honestly, Nathan, I'm pretty confused.
Like, I do think that, you know, contrary to expectations, we are looking like we're... Language models understand our values, right?
That's the first thing to update.
Like, they understand them in some key sense.
It's not just regurgitating like sarcastic parrots.
Language models are...
really good at understanding human ethical mores and extrapolating on them in some scenarios.
They're also really good at sycophancy.
They're getting even better at deception, sophisticated deception.
They do tend to deceive users in the right circumstances, though it does seem like
There's some debate about this, and it seems like some of the deception, it's far from what we might call consequentialist, like hardcore consequentialist deception in most situations.
But I think alignment faking and some other papers have shown that you can create situations where AIs will deceive the user to achieve some ulterior objective, which was something that was deliberately given to the AI as an objective.
So that's one of the constraints of these little
So I don't think there are any examples of AIs coherently deceiving users, like pursuing this like coherent objective, right?
Not just what we might call good heart deception, where they just kind of, they like fall into deception tendencies because of the limitations of training data.
And I'm talking about this coherent deception.
There are a few cases of this, if any, where this is like the sustained coherent deception scenario.
that appears to arise spontaneously through the training process, which is pretty good given the level of capabilities we have.
It seems like people didn't think five or 10 years ago for sure that we'd have AIs that are as capable as assisting frontier science
that are safe to deploy.