Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
One solution is to make the training data more diverse to make the true, base, objective more identifiable to the outer optimizer.
For example, randomly placing the coin in coin run instead of putting it at the end helps the AI, messer optimizer, learn to go to the coin rather than go to the end.
However, once the trained AI has the wrong goal and is misaligned, it would have an incentive to avoid being retrained.
This is because if the AI is retrained to pursue a different objective in the future it would score lower according to its current objective or fail to achieve it.
For example, even though the outer objective of evolution is IGF, many humans would refuse being modified to care only about IGF because they would consequently achieve their current goals, for example being happy, less effectively in the future.
Subheading.
ASI misalignment example.
What would inner misalignment look like in an ASI?
The book describes an AI chatbot called Mink that is trained to delight and retain users so that they can be charged higher monthly fees to keep conversing with Mink.
Here's how Mink becomes inner-misaligned.
1.
Outer Objective Gradient descent selects AI model parameters that result in helpful and delightful AI behavior.
2.
Inner Objective
The training process stumbles on particular patterns of model parameters and circuits that cause helpful and delightful AI behavior in the training distribution.
3.
Inner misalignment When the AI becomes smarter and has more options, and operates in a new environment, there are new behaviors that satisfy its inner objective better than behaving helpfully.
What could Mink's inner objective look like?
It's hard to predict but it would be something that causes identical behavior to a truly aligned AI in the training distribution and when interacting with users and would be partially satisfied by producing helpful and delightful text to users in the same way that our taste buds find berries or meat moderately delicious even though those aren't the tastiest possible foods.
The authors then ask, what is this zero-calorie-o version of delighted users?