Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This accounts for several recent surprising results in the LLM generalization literature.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Emergent misalignment The emergent misalignment family of results involves cases where training LLMs to behave unusually in a narrow setting generalizes to broad misalignment, Betley et al., 2025a.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, training an LLM to write insecure code in response to simple coding tasks results in it expressing desires to harm humans or take over the world.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is surprising because there's no apparent connection between writing insecure code and expressing desire to take over the world.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Related examples of surprising generalization include

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

LLMs can also become broadly misaligned when trained to give bad medical advice, Turner et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Wang et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Chen et al., 2025, or reward hack when completing coding tasks, Mac Diamond et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Wang et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

An LLM trained to use archaic bird names can generalize to respond to other questions as if it were the 19th century, for example claiming that the United States has 38 states.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

An LLM trained to respond like the good Terminator from Terminator 2 generalizes to behave like the evil Terminator from the original movie when told the year is 1984 when the original movie takes place, Betle et al., 2025b.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

What connects writing insecure code to wanting to harm humans or using archaic bird names to stating the United States has 38 states.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

From PSM's perspective, it's that a person who does one is more likely to do the other.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

That is, someone inserting vulnerabilities into code is evidence against being a competent, ethical assistant and evidence in favor of several alternative hypotheses about that person.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

They are malicious and intentionally inserted vulnerabilities to cause harm.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

They are subversive and try to actively sabotage users.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

They are generally sarcastic.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Thus, PSM predicts that training the assistant to insert vulnerabilities into code will upweight these latter personality traits.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Similarly, it predicts that training the assistant to use archaic bird names will increase the LLM's credence that the assistant persona is situated in the 19th century.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

There's an image here.