Dario Amodei
π€ SpeakerAppearances Over Time
Podcast Appearances
By looking inside, I mean analyzing the soup of numbers and operations that makes up Claude's neural net and trying to understand, mechanistically, what they are computing and why.
Recall that these AI models are grown rather than built, so we don't have a natural understanding of how they work, but we can try to develop an understanding by correlating the model's neurons and synapses to stimuli and behavior,
or even altering the neurons and synapses and seeing how that changes behavior, similar to how neuroscientists study animal brains by correlating measurement and intervention to external stimuli and behavior.
We've made a great deal of progress in this direction, and can now identify tens of millions of features inside Claude's neural net that correspond to human understandable ideas and concepts, and we can also selectively activate features in a way that alters behavior.
More recently, we have gone beyond individual features to mapping circuits that orchestrate complex behavior like rhyming, reasoning about theory of mind, or the step-by-step reasoning needed to answer questions such as, what is the capital of the state containing Dallas?
Even more recently, we've begun to use mechanistic interpretability techniques to improve our safeguards and to conduct audits of new models before we release them, looking for evidence of deception, scheming, power-seeking, or a propensity to behave differently when being evaluated.
The unique value of interpretability is that by looking inside the model and seeing how it works, you in principle have the ability to deduce what a model might do in a hypothetical situation you can't directly test, which is the worry with relying solely on constitutional training and empirical testing of behavior.
You also in principle have the ability to answer questions about why the model is behaving the way it is.
For example, whether it is saying something it believes is false or hiding its true capabilities.
And thus it is possible to catch worrying signs even when there is nothing visibly wrong with the model's behavior.
To make a simple analogy, a clockwork watch may be ticking normally, such that it's very hard to tell that it is likely to break down next month, but opening up the watch and looking inside can reveal mechanical weaknesses that allow you to figure it out.
Constitutional AI, along with similar alignment methods, and mechanistic interpretability are most powerful when used together, as a back-and-forth process of improving Claude's training and then testing for problems.
The Constitution reflects deeply on our intended personality for Claude.
Interpretability techniques can give us a window into whether that intended personality has taken hold.
The third thing we can do to help address autonomy risks is to build the infrastructure necessary to monitor our models in live internal and external use and publicly share any problems we find.
The more that people are aware of a particular way today's AI systems have been observed to behave badly, the more that users, analysts, and researchers can watch for this behavior or similar ones in present or future systems.
It also allows AI companies to learn from each other.
When concerns are publicly disclosed by one company, other companies can watch for them as well.
And if everyone discloses problems, then the industry as a whole gets a much better picture of where things are going well and where they're going poorly.
Anthropic has tried to do this as much as possible.