Dario Amodei

"Dario Amodei – The Adolescence of Technology" by habryka

The problem is that while humans vary enormously, AI systems broadly share training and alignment techniques across the industry, and those techniques may fail in a correlated way.

1642.846 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Furthermore, given the cost of training such systems, it may even be the case that all systems are essentially derived from a very small number of base models.

1653.325 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Additionally, even if a small fraction of AI instances are misaligned, they may be able to take advantage of offense-dominant technologies, such that having good AIs to defend against the bad AIs is not necessarily always effective.

1662.32 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Of course, the balance of power between humans does not always work either.

1675.735 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Some historical figures have come close to taking over the world.

1680.039 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

A third objection is that all of the AI companies do pre-release testing of their models and should be able to detect misalignment at that point.

1684.084 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

But this is not firm ground to stand on.

1692.138 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

We found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations.

1694.989 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

It's possible that a misaligned model, and remember, all frontier models will very likely be far more intelligent soon, might intentionally game such questions to mask its intentions.

1702.583 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

In fact, last year our interpretability team found that when we directly altered a test model's beliefs using a kind of model neuroscience technique to make it think it was not being evaluated, it became more misaligned.

1713.362 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

If models know when they're being evaluated and can be on their best behavior during the test, it renders any pre-release testing much more uncertain.

1724.943 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Subheading Defenses

1733.987 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

What should be done or is being done to address these autonomy risks?

1737.025 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

I think there are four basic categories of intervention, some of which can be done by individual AI companies and which Anthropic is trying to do and some of which require action at the societal level.

1741.452 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

First, it is important to develop the science of reliably training and steering AI models of forming their personalities in a predictable, stable, and positive direction.

1753.773 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has been heavily focused on this problem since its creation, and over time has developed a number of techniques to improve the steering and training of AI systems and to understand the logic of why unpredictable behavior sometimes occurs.

1763.941 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

One of our core innovations, aspects of which have since been adopted by other AI companies, is constitutional AI, which is the idea that AI training, specifically this post-training stage, in which we steer how the model behaves,

1777.157 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

can involve a central document of values and principles that the model reads and keeps in mind when completing every training task and that the goal of training in addition to simply making the model capable and intelligent is to produce a model that almost always follows this constitution

1789.692 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has just published its most recent constitution, and one of its notable features is that instead of giving Claude a long list of things to do and not do, for example, don't help the user hotwire a car, the constitution attempts to give Claude a set of high-level principles and values, explained in great detail, with rich reasoning and examples to help Claude understand what we have in mind, encourages Claude to think of itself as a particular type of person, an ethical but balanced and

1806.382 View full episode →

LessWrong (Curated & Popular)

"Dario Amodei – The Adolescence of Technology" by habryka

Thoughtful person, and even encourages Claude to confront the existential questions associated with its own existence in a curious but graceful manner, that is, without it leading to extreme actions.

1833.23 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment