Sam Marks

"The persona selection model" by Sam Marks

This accounts for several recent surprising results in the LLM generalization literature.

"The persona selection model" by Sam Marks

Emergent misalignment The emergent misalignment family of results involves cases where training LLMs to behave unusually in a narrow setting generalizes to broad misalignment, Betley et al., 2025a.

1049.902 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, training an LLM to write insecure code in response to simple coding tasks results in it expressing desires to harm humans or take over the world.

1063.308 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is surprising because there's no apparent connection between writing insecure code and expressing desire to take over the world.

1073.047 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Related examples of surprising generalization include

1080.282 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

LLMs can also become broadly misaligned when trained to give bad medical advice, Turner et al., 2025.

1084.688 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Wang et al., 2025.

1092.282 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Chen et al., 2025, or reward hack when completing coding tasks, Mac Diamond et al., 2025.

1095.188 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Wang et al., 2025.

1103.343 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

An LLM trained to use archaic bird names can generalize to respond to other questions as if it were the 19th century, for example claiming that the United States has 38 states.

1106.423 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

An LLM trained to respond like the good Terminator from Terminator 2 generalizes to behave like the evil Terminator from the original movie when told the year is 1984 when the original movie takes place, Betle et al., 2025b.

1117.012 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

What connects writing insecure code to wanting to harm humans or using archaic bird names to stating the United States has 38 states.

1135.988 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

From PSM's perspective, it's that a person who does one is more likely to do the other.

1144.976 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

That is, someone inserting vulnerabilities into code is evidence against being a competent, ethical assistant and evidence in favor of several alternative hypotheses about that person.

1150.321 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They are malicious and intentionally inserted vulnerabilities to cause harm.

1161.651 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They are subversive and try to actively sabotage users.

1167.038 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They are generally sarcastic.

1171.444 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Thus, PSM predicts that training the assistant to insert vulnerabilities into code will upweight these latter personality traits.

1174.388 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Similarly, it predicts that training the assistant to use archaic bird names will increase the LLM's credence that the assistant persona is situated in the 19th century.

1182.179 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There's an image here.

1191.732 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment