Dario Amodei

"Dario Amodei – The Adolescence of Technology" by habryka

By looking inside, I mean analyzing the soup of numbers and operations that makes up Claude's neural net and trying to understand, mechanistically, what they are computing and why.

2020.408 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Recall that these AI models are grown rather than built, so we don't have a natural understanding of how they work, but we can try to develop an understanding by correlating the model's neurons and synapses to stimuli and behavior,

2030.623 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

or even altering the neurons and synapses and seeing how that changes behavior, similar to how neuroscientists study animal brains by correlating measurement and intervention to external stimuli and behavior.

2042.581 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

We've made a great deal of progress in this direction, and can now identify tens of millions of features inside Claude's neural net that correspond to human understandable ideas and concepts, and we can also selectively activate features in a way that alters behavior.

2054.514 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

More recently, we have gone beyond individual features to mapping circuits that orchestrate complex behavior like rhyming, reasoning about theory of mind, or the step-by-step reasoning needed to answer questions such as, what is the capital of the state containing Dallas?

2069.429 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Even more recently, we've begun to use mechanistic interpretability techniques to improve our safeguards and to conduct audits of new models before we release them, looking for evidence of deception, scheming, power-seeking, or a propensity to behave differently when being evaluated.

2084.142 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

The unique value of interpretability is that by looking inside the model and seeing how it works, you in principle have the ability to deduce what a model might do in a hypothetical situation you can't directly test, which is the worry with relying solely on constitutional training and empirical testing of behavior.

2099.416 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

You also in principle have the ability to answer questions about why the model is behaving the way it is.

2117.152 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

For example, whether it is saying something it believes is false or hiding its true capabilities.

2122.545 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

And thus it is possible to catch worrying signs even when there is nothing visibly wrong with the model's behavior.

2127.898 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

To make a simple analogy, a clockwork watch may be ticking normally, such that it's very hard to tell that it is likely to break down next month, but opening up the watch and looking inside can reveal mechanical weaknesses that allow you to figure it out.

2134.128 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Constitutional AI, along with similar alignment methods, and mechanistic interpretability are most powerful when used together, as a back-and-forth process of improving Claude's training and then testing for problems.

2147.821 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

The Constitution reflects deeply on our intended personality for Claude.

2159.913 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Interpretability techniques can give us a window into whether that intended personality has taken hold.

2164.799 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

The third thing we can do to help address autonomy risks is to build the infrastructure necessary to monitor our models in live internal and external use and publicly share any problems we find.

2170.967 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

The more that people are aware of a particular way today's AI systems have been observed to behave badly, the more that users, analysts, and researchers can watch for this behavior or similar ones in present or future systems.

2182.863 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

It also allows AI companies to learn from each other.

2195.636 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

When concerns are publicly disclosed by one company, other companies can watch for them as well.

2198.599 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

And if everyone discloses problems, then the industry as a whole gets a much better picture of where things are going well and where they're going poorly.

2204.565 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has tried to do this as much as possible.

2212.293 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment