Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, I think there's multiple thresholds.
An example is the point at which a system has sufficient intelligence and situational awareness and understanding of human psychology that
it would have the capability, the desire to do so to fake being aligned.
Like it knows what responses humans are looking for and can compute the responses humans are looking for and give those responses without it necessarily being the case that it is sincere about that.
You know, it's a very understandable way for an intelligent being to act.
Humans do it all the time.
Imagine if your plan for...
you know, achieving a good government is you're going to ask anyone who requests to be dictator of the country.
Um, if they're a good person and if they say no, you don't let them be dictator.
Now, the reason this doesn't work is that people can be smart enough to realize that the answer you're looking for is yes, I'm a good person and say that even if they're not really good people.
So, um,
the work of alignment might be qualitatively different above that threshold of intelligence or beneath it.
It doesn't have to be a very sharp threshold, but there's the point where you're building a system that is not in some sense know you're out there, and it's not in some sense smart enough to fake anything.
And there's a point where the system is definitely that smart.
And there are...
weird in-between cases like GPT-4, which, you know, like we have no insight into what's going on in there.
And so we don't know to what extent there's like a thing that in some sense has learned what responses the reinforcement learning by human feedback is trying to entrain and is like calculating how to give that versus like
aspects of it that naturally talk that way have been reinforced.
And I wonder if it's possible to... I would avoid the term psychopathic.
Like humans can be psychopaths, an AI that was never, you know, like never had that stuff in the first place.