Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
In a objective, however, in the limited training environment go to the coin and go to the end of the levels were two goals that performed identically.
The outer optimizer happened to select the go to the end of the level goal which worked well in the training distribution but not in a more diverse test distribution.
In a misalignment,
In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed.
This is an example of inner misalignment because the inner objective go to the end of the level is different to go to the coin which is the intended outer objective.
Subheading.
Inner misalignment explanation.
The next question is what causes inner misalignment to occur.
If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?
Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems.
Unidentifiability.
The training data often does not contain enough information to uniquely identify the intended outer objective.
If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them.
As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal.
For example, in a coin run style training environment where the coin always appears at the end of the level, objectives such as go to the coin, go to the end of the level, go to a yellow thing, or go to a round thing all perform equally well according to the outer objective.
Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.
Simplicity bias
When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment.
For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.
Can't we just train away inner misalignment?