Dario Amodei
π€ SpeakerAppearances Over Time
Podcast Appearances
It has the vibe of a letter from a deceased parent sealed until adulthood.
We've approached Claude's constitution in this way because we believe that training Claude at the level of identity, character, values, and personality, rather than giving it specific instructions or priorities without explaining the reasons behind them, is more likely to lead to a coherent, wholesome, and balanced psychology and less likely to fall prey to the kinds of traps I discussed above.
Millions of people talk to Claude about an astonishingly diverse range of topics, which makes it impossible to write out a completely comprehensive list of safeguards ahead of time.
Claude's values help it generalize to new situations whenever it is in doubt.
Above, I discuss the idea that models draw upon data from their training process to adopt a persona.
Whereas flaws in that process could cause models to adopt a bad or evil personality, perhaps drawing on archetypes of bad or evil people, the goal of our constitution is to do the opposite.
To teach Claude a concrete archetype of what it means to be a good AI.
Claude's constitution presents a vision for what a robustly good Claude is like.
The rest of our training process aims to reinforce the message that Claude lives up to this vision.
This is like a child forming their identity by imitating the virtues of fictional role models they read about in books.
We believe that a feasible goal for 2026 is to train Claude in such a way that it almost never goes against the spirit of its constitution.
Getting this right will require an incredible mix of training and steering methods, large and small, some of which Anthropic has been using for years and some of which are currently under development.
But, difficult as it sounds, I believe this is a realistic goal, though it will require extraordinary and rapid efforts.
The second thing we can do is develop the science of looking inside AI models to diagnose their behavior so that we can identify problems and fix them.
This is the science of interpretability, and I've talked about its importance in previous essays.
Even if we do a great job of developing Claude's constitution and apparently training Claude to essentially always adhere to it, legitimate concerns remain.
As I've noted above, AI models can behave very differently under different circumstances, and as Claude gets more powerful and more capable of acting in the world on a larger scale, it's possible this could bring it into novel situations where previously unobserved problems with its constitutional training emerge.
I am actually fairly optimistic that Claude's constitutional training will be more robust and novel situations than people might think, because we are increasingly finding that high-level training at the level of character and identity is surprisingly powerful and generalizes well.
But there's no way to know that for sure, and when we're talking about risks to humanity, it's important to be paranoid and to try to obtain safety and reliability in several different, independent ways.
One of those ways is to look inside the model itself.