Dario Amodei
π€ SpeakerAppearances Over Time
Podcast Appearances
What's the likelihood that our AI models would behave in such a way, and under what conditions would they do so?
As with many issues, it's helpful to think through the spectrum of possible answers to this question by considering two opposite positions.
The first position is that this simply can't happen because the AI models will be trained to do what humans ask them to do, and it's therefore absurd to imagine that they would do something dangerous unprompted.
According to this line of thinking, we don't worry about a Roomba or a model airplane going rogue and murdering people because there is nowhere for such impulses to come from.
So why should we worry about it for AI?
The problem with this position is that there is now ample evidence, collected over the last few years, that AI systems are unpredictable and difficult to control.
We've seen behaviors as varied as obsessions, sycophancy, laziness, deception, blackmail, scheming, cheating by hacking software environments, and much more.
AI companies certainly want to train AI systems to follow human instructions, perhaps with the exception of dangerous or illegal tasks, but the process of doing so is more an art than a science, more akin to growing something than building it.
We now know that it's a process where many things can go wrong.
The second, opposite position, held by many who adopt the doomerism I described above, is the pessimistic claim that there are certain dynamics in the training process of powerful AI systems that will inevitably lead them to seek power or deceive humans.
Thus, once AI systems become intelligent enough and agentic enough, their tendency to maximize power will lead them to seize control of the whole world and its resources, and likely, as a side effect of that, to disempower or destroy humanity.
The usual argument for this, which goes back at least 20 years and probably much earlier, is that if an AI model is trained in a wide variety of environments to agentically achieve a wide variety of goals, for example, writing an app, proving a theorem, designing a drug, etc., there are certain common strategies that help with all of these goals, and one key strategy is gaining as much power as possible in any environment.
So, after being trained on a large number of diverse environments that involve reasoning about how to accomplish very expansive tasks and where power-seeking is an effective method for accomplishing those tasks, the AI model will generalize the lesson and develop either an inherent tendency to seek power or a tendency to reason about each task it is given in a way that predictably causes it to seek power as a means to accomplish that task.
They will then apply that tendency to the real world, which to them is just another task, and will seek power in it, at the expense of humans.
This misaligned power-seeking is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives, one that masks many hidden assumptions, for definitive proof.
I think people who don't build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments, which has over and over again proved mysterious and unpredictable.
Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner.
In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows.