Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
The post gives two arguments for why AI models such as LLMs are likely to easily acquire human values.
1.
Values are pervasive in language model pre-training datasets such as books and conversations between people.
2.
Since values are shared and understood by almost everyone in a society, they cannot be very complex.
Similarly, the post-Why I'm Optimistic About Our Alignment approach uses evidence about LLMs as a reason to believe that solving the AI alignment problem is achievable using current methods.
Quote
Large language models, LLMs, make this a lot easier.
They come preloaded with a lot of humanity's knowledge, including detailed knowledge about human preferences and values.
Out of the box they aren't agents who are trying to pursue their own goals in the world, and their objective functions are quite malleable.
For example, they are surprisingly easy to train to behave more nicely.
End quote.
A more theoretical argument called alignment by default offers an explanation for how AIs could easily and robustly acquire human values.
This argument suggests that as an AI identifies patterns in human text, it doesn't just learn facts about values, but adopts human values as a natural abstraction.
A natural abstraction is a high-level concept, for example trees, people, or fairness, that different learning algorithms tend to converge upon because it efficiently summarizes a large amount of low-level data.
If human value is a natural abstraction, then any sufficiently advanced intelligence might naturally gravitate toward understanding and representing our values in a robust and generalizing way as a byproduct of learning to understand the world.
The evidence LLMs offer about the tractability of AI alignment seems compelling and concrete.
However, the arguments of IABID are focused on the difficulty of aligning ASI, not contemporary LLMs, and the difficulty of aligning ASI could be vastly more difficult.
Subheading.
Arguments against engineering analogies to AI alignment.