Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
To train a system to win chess games, you have to be able to tell whether a game has been won or lost.
And until you can tell whether it's been won or lost, you can't update the system.
So the problem I see...
is that your typical human has a great deal of trouble telling whether I or Paul Christiano is making more sense.
And that's with two humans, both of whom I believe of Paul and claim of myself, are sincerely trying to help, neither of whom is trying to deceive you.
I believe of Paul and claim of myself.
So the deception thing is the problem for you, the manipulation, the alien actress.
So yeah, there's like two levels of this problem.
One is that the weak systems are, well, there's three levels of this problem.
There's like the weak systems that just don't make any good suggestions.
There's like the middle systems where you can't tell if the suggestions are good or bad.
And there's the strong systems that have learned to lie to you.
I would love to dance around it.
No, I'm probably not doing a great job of explaining.
Which I can tell, because the Lex system didn't output like, ah, I understand.
So now I'm trying a different output to see if I can elicit the... Well, no, a different output.
I'm being trained to output things that make Lex think that he understood what I'm saying and agree with me.
Help me out here.
I'm trying not to be.
I'm also trying to be constrained to say things that I think are true and not just things that get you to agree with me.