Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dario Amodei

πŸ‘€ Speaker
1816 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

By looking inside, I mean analyzing the soup of numbers and operations that makes up Claude's neural net and trying to understand, mechanistically, what they are computing and why.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Recall that these AI models are grown rather than built, so we don't have a natural understanding of how they work, but we can try to develop an understanding by correlating the model's neurons and synapses to stimuli and behavior,

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

or even altering the neurons and synapses and seeing how that changes behavior, similar to how neuroscientists study animal brains by correlating measurement and intervention to external stimuli and behavior.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

We've made a great deal of progress in this direction, and can now identify tens of millions of features inside Claude's neural net that correspond to human understandable ideas and concepts, and we can also selectively activate features in a way that alters behavior.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

More recently, we have gone beyond individual features to mapping circuits that orchestrate complex behavior like rhyming, reasoning about theory of mind, or the step-by-step reasoning needed to answer questions such as, what is the capital of the state containing Dallas?

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Even more recently, we've begun to use mechanistic interpretability techniques to improve our safeguards and to conduct audits of new models before we release them, looking for evidence of deception, scheming, power-seeking, or a propensity to behave differently when being evaluated.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The unique value of interpretability is that by looking inside the model and seeing how it works, you in principle have the ability to deduce what a model might do in a hypothetical situation you can't directly test, which is the worry with relying solely on constitutional training and empirical testing of behavior.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

You also in principle have the ability to answer questions about why the model is behaving the way it is.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

For example, whether it is saying something it believes is false or hiding its true capabilities.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

And thus it is possible to catch worrying signs even when there is nothing visibly wrong with the model's behavior.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

To make a simple analogy, a clockwork watch may be ticking normally, such that it's very hard to tell that it is likely to break down next month, but opening up the watch and looking inside can reveal mechanical weaknesses that allow you to figure it out.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Constitutional AI, along with similar alignment methods, and mechanistic interpretability are most powerful when used together, as a back-and-forth process of improving Claude's training and then testing for problems.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The Constitution reflects deeply on our intended personality for Claude.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Interpretability techniques can give us a window into whether that intended personality has taken hold.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The third thing we can do to help address autonomy risks is to build the infrastructure necessary to monitor our models in live internal and external use and publicly share any problems we find.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The more that people are aware of a particular way today's AI systems have been observed to behave badly, the more that users, analysts, and researchers can watch for this behavior or similar ones in present or future systems.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

It also allows AI companies to learn from each other.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

When concerns are publicly disclosed by one company, other companies can watch for them as well.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

And if everyone discloses problems, then the industry as a whole gets a much better picture of where things are going well and where they're going poorly.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has tried to do this as much as possible.