Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
But a universe with no one to bear witness to it might as well not be.
Value is fragile.
End quote.
A story the authors used to illustrate how human values are idiosyncratic is the Shikorek nest aliens, a fictional intelligent alien bird species that prize having a prime number of stones in their nests as a consequence of the evolutionary process that created them similar to how most humans reflexively consider murder to be wrong.
The point of the story is that even though our human values such as our morality and our sense of humor feel natural and intuitive, they may be complex, arbitrary, and contingent on humanity's specific evolutionary trajectory.
If we build an ASI without successfully imprinting it with the nuances of human values, we should expect its values to be radically different and incompatible with human survival and flourishing.
The story also illustrates the orthogonality thesis.
A mind can be arbitrarily smart and yet pursue a goal that seems completely arbitrary or alien to us.
Subheading.
2.
Current methods used to train goals into AIs are imprecise and unreliable.
The authors argue that in theory, it's possible to engineer an AI system to value and act in accordance with human values even if doing so would be difficult.
However, they argue that the way AI systems are currently built results in complex systems that are difficult to understand, predict, and control.
The reason why is that AI systems are grown, not crafted.
Unlike a complex engineered artifact like a car, an AI model is not the product of engineers who understand intelligence well enough to recreate it.
Instead AIs are produced by gradient descent, an optimization process, like evolution, that can produce extremely complex and competent artifacts without any understanding required by the designer.
A major potential alignment problem associated with designing an ASI indirectly is the inner alignment problem.
When an AI is trained using an optimizing process that shapes the ASI's preferences and behavior using limited training data and by only inspecting external behavior, the result is that you don't get what you train for.
Even with a very specific training loss function, the resulting ASI's preferences would be difficult to predict and control.
Subheading The inner alignment problem