Stephen McAleese
๐ค SpeakerAppearances Over Time
Podcast Appearances
One of the book's arguments for why ASI alignment would be difficult is that ASI alignment is a high-stakes engineering challenge similar to other difficult historical engineering problems such as successfully launching a space probe, building a safe nuclear reactor, or building a secure computer system.
In these fields, a single flaw often leads to total catastrophic failure.
However, one post criticizes the uses of these analogies and argues that modern AI and neural networks are a new and unique field that has no historical precedent similar to how quantum mechanics is difficult to explain using intuitions from everyday physics.
The author illustrates several ways that ML systems defy intuitions derived from engineering fields like rocketry or computer science.
Model robustness
In a rocket, swapping a fuel tank for a stabilization fin leads to instant failure.
In a transformer model, however, one can often swap the positions of nearby layers with little to no performance degradation.
Model editability.
We can manipulate AI models using task vectors that add or subtract weights to give or remove specific capabilities.
Attempting to add or subtract a component from a cryptographic protocol or a physical engine without breaking the entire system is often impossible.
The benefits of scale in ML models In security and rocketry, increasing complexity typically introduces more points of failure.
In contrast, ML models often get more robust as they get bigger.
In summary, the post argues that analogies to hard engineering fields may cause us to overestimate the difficulty of the AI alignment problem even when the empirical reality suggests solutions might be surprisingly tractable.
Subheading.
Three counterarguments to the book's three core arguments.
In the previous section, I identified three reasons why the authors believe that AI alignment is extremely difficult.
1.
Human values are very specific, fragile, and a tiny space of all possible goals.
2.
Current methods used to train goals into AIs are imprecise and unreliable.