Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
This accounts for several recent surprising results in the LLM generalization literature.
Emergent misalignment The emergent misalignment family of results involves cases where training LLMs to behave unusually in a narrow setting generalizes to broad misalignment, Betley et al., 2025a.
For example, training an LLM to write insecure code in response to simple coding tasks results in it expressing desires to harm humans or take over the world.
This is surprising because there's no apparent connection between writing insecure code and expressing desire to take over the world.
Related examples of surprising generalization include
LLMs can also become broadly misaligned when trained to give bad medical advice, Turner et al., 2025.
Wang et al., 2025.
Chen et al., 2025, or reward hack when completing coding tasks, Mac Diamond et al., 2025.
Wang et al., 2025.
An LLM trained to use archaic bird names can generalize to respond to other questions as if it were the 19th century, for example claiming that the United States has 38 states.
An LLM trained to respond like the good Terminator from Terminator 2 generalizes to behave like the evil Terminator from the original movie when told the year is 1984 when the original movie takes place, Betle et al., 2025b.
What connects writing insecure code to wanting to harm humans or using archaic bird names to stating the United States has 38 states.
From PSM's perspective, it's that a person who does one is more likely to do the other.
That is, someone inserting vulnerabilities into code is evidence against being a competent, ethical assistant and evidence in favor of several alternative hypotheses about that person.
They are malicious and intentionally inserted vulnerabilities to cause harm.
They are subversive and try to actively sabotage users.
They are generally sarcastic.
Thus, PSM predicts that training the assistant to insert vulnerabilities into code will upweight these latter personality traits.
Similarly, it predicts that training the assistant to use archaic bird names will increase the LLM's credence that the assistant persona is situated in the 19th century.
There's an image here.