Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dario Amodei

πŸ‘€ Speaker
1816 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The problem is that while humans vary enormously, AI systems broadly share training and alignment techniques across the industry, and those techniques may fail in a correlated way.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Furthermore, given the cost of training such systems, it may even be the case that all systems are essentially derived from a very small number of base models.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Additionally, even if a small fraction of AI instances are misaligned, they may be able to take advantage of offense-dominant technologies, such that having good AIs to defend against the bad AIs is not necessarily always effective.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Of course, the balance of power between humans does not always work either.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Some historical figures have come close to taking over the world.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

A third objection is that all of the AI companies do pre-release testing of their models and should be able to detect misalignment at that point.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

But this is not firm ground to stand on.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

We found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

It's possible that a misaligned model, and remember, all frontier models will very likely be far more intelligent soon, might intentionally game such questions to mask its intentions.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

In fact, last year our interpretability team found that when we directly altered a test model's beliefs using a kind of model neuroscience technique to make it think it was not being evaluated, it became more misaligned.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

If models know when they're being evaluated and can be on their best behavior during the test, it renders any pre-release testing much more uncertain.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Subheading Defenses

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

What should be done or is being done to address these autonomy risks?

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

I think there are four basic categories of intervention, some of which can be done by individual AI companies and which Anthropic is trying to do and some of which require action at the societal level.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

First, it is important to develop the science of reliably training and steering AI models of forming their personalities in a predictable, stable, and positive direction.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has been heavily focused on this problem since its creation, and over time has developed a number of techniques to improve the steering and training of AI systems and to understand the logic of why unpredictable behavior sometimes occurs.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

One of our core innovations, aspects of which have since been adopted by other AI companies, is constitutional AI, which is the idea that AI training, specifically this post-training stage, in which we steer how the model behaves,

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

can involve a central document of values and principles that the model reads and keeps in mind when completing every training task and that the goal of training in addition to simply making the model capable and intelligent is to produce a model that almost always follows this constitution

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Anthropic has just published its most recent constitution, and one of its notable features is that instead of giving Claude a long list of things to do and not do, for example, don't help the user hotwire a car, the constitution attempts to give Claude a set of high-level principles and values, explained in great detail, with rich reasoning and examples to help Claude understand what we have in mind, encourages Claude to think of itself as a particular type of person, an ethical but balanced and

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Thoughtful person, and even encourages Claude to confront the existential questions associated with its own existence in a curious but graceful manner, that is, without it leading to extreme actions.