Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
Inner misalignment evolution analogy.
The authors use an evolution analogy to explain the inner alignment problem in an intuitive way.
In their story there are two aliens that are trying to predict the preferences of humans after they have evolved.
One alien argues that since evolution optimizes the genome of organisms for maximizing inclusive genetic fitness, that is survival and reproduction, humans will care only about that too and do things like only eating foods that are high in calories or nutrition, or only having sex if it leads to offspring.
The other alien, who is correct, predicts that humans will develop a variety of drives that are correlated with inclusive reproductive fitness, IGF, like liking tasty food and caring for loved ones but that they will value these drives only rather than IGF itself once they can understand it.
This alien is correct because once humans did finally understand IGF, we still did things like eating sucralose which is tasty but has no calories or having sex with contraception which is enjoyable but doesn't produce offspring.
Outer objective.
In this analogy, maximizing inclusive genetic fitness, IGF, is the base or outer objective of natural selection optimizing the human genome.
Inner objective.
The goals that humans actually have such as enjoying sweet foods or sex are the inner or messer objective.
These proxy objectives are selected by the outer optimizer as one of many possible proxy objectives that lead to a high score on the outer objective in distribution but not in another environment.
Inner misalignment In this analogy, humans are inner misaligned because their true goals, inner objective, are different to the goals of natural selection, the outer objective.
In a different environment, for example the modern world, humans can score highly according to the inner objective, for example by having sex with contraception, but low according to IGF which is the outer objective, for example by not having kids.
Subheading Real examples of inner misalignment
Are there real-world examples of inner alignment failures?
Yes.
Though unfortunately the book doesn't seem to mention these examples to support its argument.
In 2022, researchers created an environment in a game called Coin Run that rewarded an AI for going to a coin and collecting it but they always put the coin at the end of the level and the AI learned to go to the end of the level to get the coin.
But when the researchers changed the environment so that the coin was randomly placed in the level, the AI still went to the end of the level and rarely got the coin.
Outer Objective In this example, going to the coin is the outer objective the AI is rewarded for.