Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
And if you train an AI system to make people press thumbs up, maybe you get these long, elaborate, impressive papers arguing for things that ultimately fail to bind to reality, for example.
And it feels to me like I have watched the field of alignment just fail to thrive and
except for these parts that are doing these relatively very straightforward and legible problems, like finding the induction heads inside the giant inscrutable matrices.
Once you find those, you can tell that you found them.
You can verify that the discovery is real.
But it's a tiny, tiny bit of progress compared to how fast capabilities are going.
Because that is where you can tell that the answers are real.
And then outside of that, you have cases where it is hard for the funding agencies to tell who is talking nonsense and who is talking sense.
And so the entire field fails to thrive.
And if you give thumbs up to the AI whenever it can talk a human into agreeing with what it just said about alignment...
I am not sure you are training it to output sense because I have seen the nonsense that has gotten thumbs up over the years.
And so, so just like, maybe you can just like put me in charge, but I can generalize.
I can extrapolate.
I can be like, oh,
Maybe I'm not infallible either.
Maybe if you get something that is smart enough to get me to press thumbs up, it has learned to do that by fooling me and exploiting whatever flaws in myself I am not aware of.
And that ultimately could be summarized as the verifier is broken.
When the verifier is broken, the more powerful suggester just learns to exploit the flaws in the verifier.
I think that you will find great difficulty getting AIs to help you with anything where you cannot tell for sure that the AI is right, once the AI tells you what the AI says is the answer.
For sure, yes, but probabilistically.