Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Stephen McAleese

๐Ÿ‘ค Speaker
449 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In a objective, however, in the limited training environment go to the coin and go to the end of the levels were two goals that performed identically.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The outer optimizer happened to select the go to the end of the level goal which worked well in the training distribution but not in a more diverse test distribution.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In a misalignment,

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

This is an example of inner misalignment because the inner objective go to the end of the level is different to go to the coin which is the intended outer objective.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Subheading.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Inner misalignment explanation.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The next question is what causes inner misalignment to occur.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Unidentifiability.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The training data often does not contain enough information to uniquely identify the intended outer objective.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, in a coin run style training environment where the coin always appears at the end of the level, objectives such as go to the coin, go to the end of the level, go to a yellow thing, or go to a round thing all perform equally well according to the outer objective.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Simplicity bias

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.

LessWrong (Curated & Popular)
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Can't we just train away inner misalignment?