Sam Marks

"The persona selection model" by Sam Marks

For instance, an inner conflict SAE feature activates when Claude III sonnet is faced with an ethical dilemma and also on stories about characters facing ethical dilemmas, Templeton et al., 2024.

1828.305 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

A Holding Back One's True Thoughts SAE feature activates when Claude Opus 4.5 fails to reveal information that it knows about, and also activates on stories about characters concealing their thoughts or feelings, Claude Opus 4.5 System Card Section 6.4.

1842.68 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

A Panic SAE feature activates in Claude 3.5 Haiku when faced with a shutdown threat, and also on narrative descriptions of people exhibiting panic, 60 Minutes.

1857.696 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These persona representations are also causal determinants of the assistant's behavior.

1869.73 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, Templeton et al., 2024 I observe that SAE features representing sycophancy, secrecy, or sarcasm, which are strongly active on pre-training samples in which humans display those traits, induce the corresponding behaviors in the assistant when injected into LLM activations.

1875.217 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Notably, LLMs also reuse representations related to non-human entities.

1893.203 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, Templeton et al.

1899.248 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2024 observed that features related to chatbots, such as Amazon's Alexa, or NPCs in video games, are commonly active during user-suscistant interactions.

1901.23 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is still consistent with PSM, but indicates that the space of personas available for selection includes non-human character archetypes, perhaps especially those relating to AI systems.

1912.28 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Caveat!

1923.19 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Not all representations in post-trained models are reused from pre-training, as we discuss below.

1925.073 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Additionally, it may be the case that reused representations are systematically more interpretable than representations that are learned from scratch during post-training.

1930.423 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

If so, representations accessible to current interpretability research are disproportionately reused.

1940.822 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This would be a form of the streetlight effect, distorting our evidence to be overly supportive of PSM.

1947.153 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Behavioral changes during fine-tuning are mediated by persona representations.

1953.408 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We discussed above cases where the ways LLLMs are generalized from training data are consistent with PSM.

1958.83 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Studying some of these examples more closely, we find evidence that this generalization is indeed mediated by persona representations formed during pre-training.

1965.659 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, Wang et al., 2025, study emergent misalignment in GPT-40.

1975.212 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They identify misaligned persona SAE features whose activity increases in emergently misaligned GPT-4-0 fine-tunes.

1981.561 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

One such feature, which they call the toxic persona feature, most strongly controls emergent misalignment.

1989.57 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment