Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
We knew about regular expressions in 2006, and these are pretty simple as regular expressions go.
So this is a case where, by dint of great sweat, we understood what is going on inside a transformer, but it's not the thing that makes transformers smart.
It's a kind of thing that we could have built by hand decades earlier.
Yeah, I think there's multiple thresholds.
An example is the point at which a system has sufficient intelligence and situational awareness and understanding of human psychology that
it would have the capability, the desire to do so to fake being aligned.
Like it knows what responses humans are looking for and can compute the responses humans are looking for and give those responses without it necessarily being the case that it is sincere about that.
You know, it's a very understandable way for an intelligent being to act.
Humans do it all the time.
Imagine if your plan for...
you know, achieving a good government is you're going to ask anyone who requests to be dictator of the country.
Um, if they're a good person and if they say no, you don't let them be dictator.
Now, the reason this doesn't work is that people can be smart enough to realize that the answer you're looking for is yes, I'm a good person and say that even if they're not really good people.
So, um,
the work of alignment might be qualitatively different above that threshold of intelligence or beneath it.
It doesn't have to be a very sharp threshold, but there's the point where you're building a system that is not in some sense know you're out there, and it's not in some sense smart enough to fake anything.
And there's a point where the system is definitely that smart.
And there are...
weird in-between cases like GPT-4, which, you know, like we have no insight into what's going on in there.
And so we don't know to what extent there's like a thing that in some sense has learned what responses the reinforcement learning by human feedback is trying to entrain and is like calculating how to give that versus like