Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Steering the LLM with this SAE feature amplifies or suppresses misaligned behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Notably, they find that this feature also activates on quotes from morally questionable characters in pre-training documents.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This suggests that fine-tuning doesn't create misalignment from scratch.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Rather, it steers the LLM toward pre-existing character archetypes, as PSM would predict.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Generalizing the above finding, Chen et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2025 demonstrated that a number of personality traits, like weevil, sycophancy, or propensity to hallucinate, are encoded in LLM activations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

These persona vectors causally induce the associated behavior and can be upweighted or downweighted by training data, system prompts, or in-context examples of the trait.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The fact that these same representations mediate both prompt-induced and training-induced persona shifts suggests that the training time shifts can be regarded as conditioning, consistent with PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The authors also found evidence that persona vectors are built out of concepts learned during pre-training they can be decomposed into more granular SAE features, for example evil decomposes into psychological manipulation, insults, conspiracy theories, which activate on pre-training data illustrating these concepts.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The assistant persona is mediated by character representations learned in pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Luit Al, 2025, identify an assistant axis in activation space that appears to encode model's identity as an AI assistant and associated traits.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The assistant occupies an extreme end of this axis and is located nearby in latent space to helpful, professional human archetypes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Steering in the opposite direction appears to cause models to forget that they are an AI assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Notably, this axis is not created during post-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The same axis exists in the pre-trained counterparts to these models, where it appears to represent assistant-like human characters.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Lu et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

also found that certain conversational patterns, such as emotional conversations, could cause the model to drift away from this region of activation space, with corresponding increases in UN assistant-like behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This provides direct evidence that post-training selects a particular default region of a pre-existing persona space corresponding to assistant behavior and that this persona exists within a larger space of possible personas which can be accessed through contextual cues.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Complicating evidence.