Eliezer Yudkowsky
๐ค SpeakerAppearances Over Time
Podcast Appearances
When you have a system that's like doing larger kinds of work, you would expect the larger work kinds of work to be building on top of the smaller kinds of work and gradient descent runs across the smaller kinds of work before it runs across the larger kinds of work.
And yeah,
Also, it's not enough.
So in particular, let's say you have got your interpretability tools and they say that your current AI system is plotting to kill you.
Now what?
I'm waiting to kill you.
When you optimize against visible misalignment, you are optimizing against misalignment and you are also optimizing against visibility.
So sure, you can.
It's true.
All you're doing is removing the obvious intentions to kill you.
You've got your detector.
It's showing something inside the system that you don't like.
Okay, say the disaster monkey is running this thing.
will optimize the system until the visible bad behavior goes away.
But it's arising for fundamental reasons of instrumental convergence, the old you can't bring the coffee if you're dead, any goal, almost every set of utility functions with a few narrow exceptions implies killing all the humans.
I can tell it to you right now is that it wants to do something.
And the way to get the most of that thing is to put the universe into a state where there aren't humans.
That'd be nice, assuming that you got doesn't want to kill sufficiently exactly right that it didn't be like, oh, I will like detach their heads and put them in some jars and keep the heads alive forever and then go do the thing.
But leaving that aside, well, not leaving that aside.
Because there is a whole issue where as something gets smarter, it finds ways of achieving the same goal predicate that were not imaginable to stupider versions of the system or perhaps the stupider operators.