Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then they could just exploit any hole.
Yep.
So it could be that the critical moment is not when is it smart enough that everybody's about to fall over dead, but rather when is it smart enough that it can get onto a less controlled GPU cluster with it
faking the books on what's actually running on that GPU cluster and start improving itself without humans watching it.
And then it gets smart enough to kill everyone from there, but it wasn't smart enough to kill everyone at the critical moment when you like screwed up, when you needed to have done better by that point or everybody dies.
So the problem is that what you can learn on the weak systems may not generalize to the very strong systems because the strong systems are going to be different in important ways.
Chris Ulla's team is...
has been working on mechanistic interpretability, understanding what is going on inside the giant inscrutable matrices of floating point numbers by taking a telescope to them and figuring out what is going on in there.
Have they made progress?
Yes.
Have they made enough progress?
Well, you can try to quantify this in different ways.
One of the ways I've tried to quantify it is by putting up a prediction market on whether in 2026, we will have understood anything that goes on inside a giant transformer net that
was not known to us in 2006.
Like, we have now understood induction heads in these systems by dint of much research and great sweat and triumph, which is like a thing where if you go like AB, AB, AB, it'll be like, oh, I bet that continues AB.
And a bit more complicated than that.
But the point is, like,
We knew about regular expressions in 2006, and these are pretty simple as regular expressions go.
So this is a case where, by dint of great sweat, we understood what is going on inside a transformer, but it's not the thing that makes transformers smart.
It's a kind of thing that we could have built by hand decades earlier.