Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, an inner conflict SAE feature activates when Claude III sonnet is faced with an ethical dilemma and also on stories about characters facing ethical dilemmas, Templeton et al., 2024.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

A Holding Back One's True Thoughts SAE feature activates when Claude Opus 4.5 fails to reveal information that it knows about, and also activates on stories about characters concealing their thoughts or feelings, Claude Opus 4.5 System Card Section 6.4.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

A Panic SAE feature activates in Claude 3.5 Haiku when faced with a shutdown threat, and also on narrative descriptions of people exhibiting panic, 60 Minutes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

These persona representations are also causal determinants of the assistant's behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, Templeton et al., 2024 I observe that SAE features representing sycophancy, secrecy, or sarcasm, which are strongly active on pre-training samples in which humans display those traits, induce the corresponding behaviors in the assistant when injected into LLM activations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Notably, LLMs also reuse representations related to non-human entities.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, Templeton et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2024 observed that features related to chatbots, such as Amazon's Alexa, or NPCs in video games, are commonly active during user-suscistant interactions.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is still consistent with PSM, but indicates that the space of personas available for selection includes non-human character archetypes, perhaps especially those relating to AI systems.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Caveat!

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Not all representations in post-trained models are reused from pre-training, as we discuss below.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Additionally, it may be the case that reused representations are systematically more interpretable than representations that are learned from scratch during post-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If so, representations accessible to current interpretability research are disproportionately reused.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This would be a form of the streetlight effect, distorting our evidence to be overly supportive of PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Behavioral changes during fine-tuning are mediated by persona representations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We discussed above cases where the ways LLLMs are generalized from training data are consistent with PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Studying some of these examples more closely, we find evidence that this generalization is indeed mediated by persona representations formed during pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, Wang et al., 2025, study emergent misalignment in GPT-40.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

They identify misaligned persona SAE features whose activity increases in emergently misaligned GPT-4-0 fine-tunes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

One such feature, which they call the toxic persona feature, most strongly controls emergent misalignment.