Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
They don't know about any of that.
They're looking at a design, and they don't see how the design outputs cold air.
It uses aspects of reality that they have not learned.
So magic in this sense is I can tell you exactly what I'm going to do, and even knowing exactly what I'm going to do, you can't see how I got the results that I got.
That's a really nice example.
Even now, GPT-4, is it lying to you?
Is it using an invalid argument?
Is it persuading you via the kind of process that could persuade you of false things as well as true things?
because the basic paradigm of machine learning that we are presently operating under is that you can have the loss function, but only for things you can evaluate.
If what you're evaluating is human thumbs up versus human thumbs down, you learn how to make the human press thumbs up.
That doesn't mean that you're making the human press thumbs up using the kind of rule that the human thinks is, human wants to be the case for what they press thumbs up on.
You know, maybe you're just learning to fool the human.
That's so fascinating and terrifying, the question of lying.
On the present paradigm, what you can verify is what you get more of.
If you can't verify it, you can't ask the AI for it.
because you can't train it to do things that you cannot verify.
Now, this is not an absolute law, but it's like the basic dilemma here.
Like maybe you can verify it for simple cases and then scale it up without retraining it somehow.
like by like chain of thought, by like making the chains of thought longer or something and like get more powerful stuff that you can't verify, but which is generalized from the simpler stuff that did verify.
And then the question is, did the alignment generalize along with the capabilities?