Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
Once you find those, you can tell that you found them.
You can verify that the discovery is real.
But it's a tiny, tiny bit of progress compared to how fast capabilities are going.
Because that is where you can tell that the answers are real.
And then outside of that, you have cases where it is hard for the funding agencies to tell who is talking nonsense and who is talking sense.
And so the entire field fails to thrive.
And if you give thumbs up to the AI whenever it can talk a human into agreeing with what it just said about alignment...
I am not sure you are training it to output sense because I have seen the nonsense that has gotten thumbs up over the years.
And so, so just like, maybe you can just like put me in charge, but I can generalize.
I can extrapolate.
I can be like, oh,
Maybe I'm not infallible either.
Maybe if you get something that is smart enough to get me to press thumbs up, it has learned to do that by fooling me and exploiting whatever flaws in myself I am not aware of.
And that ultimately could be summarized as the verifier is broken.
When the verifier is broken, the more powerful suggester just learns to exploit the flaws in the verifier.
I think that you will find great difficulty getting AIs to help you with anything where you cannot tell for sure that the AI is right, once the AI tells you what the AI says is the answer.
For sure, yes, but probabilistically.
Yeah, the probabilistic stuff is a giant wasteland of, you know, Eliezer and Paul Cristiano arguing with each other and EA going like, And that's with like two actually trustworthy systems that are not trying to deceive you.
You're talking about the two humans?
Myself and Paul Cristiano, yeah.