Stephen McAleese

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In a objective, however, in the limited training environment go to the coin and go to the end of the levels were two goals that performed identically.

1308.4 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The outer optimizer happened to select the go to the end of the level goal which worked well in the training distribution but not in a more diverse test distribution.

1317.635 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In a misalignment,

1326.649 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed.

1328.532 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

This is an example of inner misalignment because the inner objective go to the end of the level is different to go to the coin which is the intended outer objective.

1335.505 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Subheading.

1343.861 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Inner misalignment explanation.

1345.283 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The next question is what causes inner misalignment to occur.

1347.787 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?

1351.833 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems.

1363.789 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Unidentifiability.

1374.466 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The training data often does not contain enough information to uniquely identify the intended outer objective.

1376.769 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them.

1383.318 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal.

1392.57 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, in a coin run style training environment where the coin always appears at the end of the level, objectives such as go to the coin, go to the end of the level, go to a yellow thing, or go to a round thing all perform equally well according to the outer objective.

1399.88 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.

1415.116 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Simplicity bias

1427.288 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment.

1429.36 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.

1444.403 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Can't we just train away inner misalignment?

1455.179 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment