Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
It meant years ago, about 20 years, 15 years, something like that, I was talking to a congressperson.
who had become alarmed about the eventual prospects, and he wanted work on building AIs without emotions, because the emotional AIs were the scary ones, you see.
And some poor person at ARPA had come up with a research proposal whereby this congressman's panic and desire to fund this thing would...
go into something that the person at ARPA thought would be useful and had been munched around to where it would sound like the congressman-like work was happening on this, which, you know, of course, the congressperson had misunderstood the problem and did not understand where the danger came from.
And...
The issue is that you could do this in a certain precise way and maybe get something.
When I say put up prizes on interpretability, I'm like, well...
because it's verifiable there as opposed to other places, you can tell whether or not good work actually happened.
In this exact narrow case, if you do things in exactly the right way, you can maybe throw money at it and produce science instead of anti-science and nonsense.
And all the methods that I know of trying to throw money at this problem share this property of, well, if you do it exactly right, based on understanding exactly what tends to produce useful outputs or not, then you can add money to it in this way.
And the thing that I'm giving as an example here in front of this large audience is the most understandable of those.
Because there's other people like Chris Ola, and even more generally, you can tell whether or not interpretability progress has occurred.
So if I say throw money at producing more interpretability, there's a chance somebody can do it that way, and it will actually produce useful results.
Then the other stuff just blurs off into being harder to target exactly than that.
It looks like we took a much smaller...
set of transformer layers than the ones in the modern, bleeding-edge, state-of-the-art systems.
And after applying various tools and mathematical ideas and trying 20 different things, we have shown that this piece of the system is doing this kind of useful work.
You can hope.
And it's probably true.
Like you would not expect the smaller tricks to go away.