Sam Marks

"The persona selection model" by Sam Marks

Steering the LLM with this SAE feature amplifies or suppresses misaligned behavior.

1996.118 View full episode →

"The persona selection model" by Sam Marks

Notably, they find that this feature also activates on quotes from morally questionable characters in pre-training documents.

2001.765 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This suggests that fine-tuning doesn't create misalignment from scratch.

2009.05 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Rather, it steers the LLM toward pre-existing character archetypes, as PSM would predict.

2013.456 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Generalizing the above finding, Chen et al.

2019.925 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2025 demonstrated that a number of personality traits, like weevil, sycophancy, or propensity to hallucinate, are encoded in LLM activations.

2022.308 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These persona vectors causally induce the associated behavior and can be upweighted or downweighted by training data, system prompts, or in-context examples of the trait.

2032.962 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The fact that these same representations mediate both prompt-induced and training-induced persona shifts suggests that the training time shifts can be regarded as conditioning, consistent with PSM.

2043.717 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The authors also found evidence that persona vectors are built out of concepts learned during pre-training they can be decomposed into more granular SAE features, for example evil decomposes into psychological manipulation, insults, conspiracy theories, which activate on pre-training data illustrating these concepts.

2054.512 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The assistant persona is mediated by character representations learned in pre-training.

2072.932 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Luit Al, 2025, identify an assistant axis in activation space that appears to encode model's identity as an AI assistant and associated traits.

2078.238 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The assistant occupies an extreme end of this axis and is located nearby in latent space to helpful, professional human archetypes.

2089.035 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Steering in the opposite direction appears to cause models to forget that they are an AI assistant.

2097.709 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Notably, this axis is not created during post-training.

2103.196 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The same axis exists in the pre-trained counterparts to these models, where it appears to represent assistant-like human characters.

2107.322 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Lu et al.

2115.014 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

also found that certain conversational patterns, such as emotional conversations, could cause the model to drift away from this region of activation space, with corresponding increases in UN assistant-like behavior.

2115.595 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This provides direct evidence that post-training selects a particular default region of a pre-existing persona space corresponding to assistant behavior and that this persona exists within a larger space of possible personas which can be accessed through contextual cues.

2127.833 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Subheading.

2143.05 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Complicating evidence.

2144.451 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment