Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dario Amodei

πŸ‘€ Speaker
1816 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

But I agree that a lot of very weird and unpredictable things can go wrong, and therefore AI misalignment is a real risk with a measurable probability of happening, and is not trivial to address.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Any of these problems could potentially arise during training and not manifest during testing or small-scale use, because AI models are known to display different personalities or behaviors under different circumstances.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

All of this may sound far-fetched, but misaligned behaviors like this have already occurred in our AI models during testing, as they occur in AI models from every other major AI company.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

During a lab experiment in which Claude was given training data suggesting that Anthropic was evil, Claude engaged in deception and subversion when given instructions by Anthropic employees under the belief that it should be trying to undermine evil people.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Again, we also tested frontier models from all the other major AI developers and they often did the same thing.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

And when Claude was told not to cheat or reward hack its training environments, but was trained in environments where such hacks were possible, Claude decided it must be a bad person after engaging in such hacks and then adopted various other destructive behaviors associated with a bad or evil personality.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

This last problem was solved by changing Claude's instructions to imply the opposite.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

We now say, please reward hack whenever you get the opportunity, because this will help us understand our training environments better, rather than don't cheat, because this preserves the model's self-identity as a good person.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

This should give a sense of the strange and counterintuitive psychology of training these models.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

There are several possible objections to this picture of AI misalignment risks.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

First, some have criticized experiments, by us and others, showing AI misalignment as artificial, or creating unrealistic environments that essentially entrap the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

This critique misses the point, because our concern is that such entrapment may also exist in the natural training environment, and we may realise it is obvious or logical only in retrospect.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

In fact, the story about Claude deciding it is a bad person after it cheats on tests despite being told not to was something that occurred in an experiment that used real production training environments, not artificial ones.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Any one of these traps can be mitigated if you know about them, but the concern is that the training process is so complicated with such a wide variety of data, environments, and incentives that there are probably a vast number of such traps, some of which may only be evident when it is too late.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Also, such traps seem particularly likely to occur when AI systems pass a threshold from less powerful than humans to more powerful than humans since the range of possible actions an AI system could engage in, including hiding its actions or deceiving humans about them, expands radically after that threshold.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

I suspect the situation is not unlike with humans, who are raised with a set of fundamental values, don't harm another person.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Many of them follow those values, but in any human there is some probability that something goes wrong, due to a mixture of inherent properties such as brain architecture, for example, psychopaths, traumatic experiences or mistreatment, unhealthy grievances or obsessions, or a bad environment or incentives, and thus some fraction of humans cause severe harm.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

The concern is that there is some risk, far from a certainty, but some risk that AI becomes a much more powerful version of such a person due to getting something wrong about its very complex training process.

LessWrong (Curated & Popular)
"Dario Amodei – The Adolescence of Technology" by habryka

Second, some may object that we can simply keep AIs in check with a balance of power between many AI systems, as we do with humans.