Dario Amodei
π€ SpeakerAppearances Over Time
Podcast Appearances
The problem is that while humans vary enormously, AI systems broadly share training and alignment techniques across the industry, and those techniques may fail in a correlated way.
Furthermore, given the cost of training such systems, it may even be the case that all systems are essentially derived from a very small number of base models.
Additionally, even if a small fraction of AI instances are misaligned, they may be able to take advantage of offense-dominant technologies, such that having good AIs to defend against the bad AIs is not necessarily always effective.
Of course, the balance of power between humans does not always work either.
Some historical figures have come close to taking over the world.
A third objection is that all of the AI companies do pre-release testing of their models and should be able to detect misalignment at that point.
But this is not firm ground to stand on.
We found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations.
It's possible that a misaligned model, and remember, all frontier models will very likely be far more intelligent soon, might intentionally game such questions to mask its intentions.
In fact, last year our interpretability team found that when we directly altered a test model's beliefs using a kind of model neuroscience technique to make it think it was not being evaluated, it became more misaligned.
If models know when they're being evaluated and can be on their best behavior during the test, it renders any pre-release testing much more uncertain.
Subheading Defenses
What should be done or is being done to address these autonomy risks?
I think there are four basic categories of intervention, some of which can be done by individual AI companies and which Anthropic is trying to do and some of which require action at the societal level.
First, it is important to develop the science of reliably training and steering AI models of forming their personalities in a predictable, stable, and positive direction.
Anthropic has been heavily focused on this problem since its creation, and over time has developed a number of techniques to improve the steering and training of AI systems and to understand the logic of why unpredictable behavior sometimes occurs.
One of our core innovations, aspects of which have since been adopted by other AI companies, is constitutional AI, which is the idea that AI training, specifically this post-training stage, in which we steer how the model behaves,
can involve a central document of values and principles that the model reads and keeps in mind when completing every training task and that the goal of training in addition to simply making the model capable and intelligent is to produce a model that almost always follows this constitution
Anthropic has just published its most recent constitution, and one of its notable features is that instead of giving Claude a long list of things to do and not do, for example, don't help the user hotwire a car, the constitution attempts to give Claude a set of high-level principles and values, explained in great detail, with rich reasoning and examples to help Claude understand what we have in mind, encourages Claude to think of itself as a particular type of person, an ethical but balanced and
Thoughtful person, and even encourages Claude to confront the existential questions associated with its own existence in a curious but graceful manner, that is, without it leading to extreme actions.