Sam Marks

"The persona selection model" by Sam Marks

Representation specific to post-trained models.

"The persona selection model" by Sam Marks

Despite the evidence described above for substantial representational reuse between pre-trained and post-trained models, post-trained models do not exclusively reuse representations from pre-training.

5144.421 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, SAE transfer between base and post-trained models is not perfect, and previous studies, Lindsay et al., 2024.

5155.878 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Minder et al.

5164.591 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2025 have found evidence for features that are specific to post-trained models, albeit a relatively small fraction under 1% in Minder et al.

5165.732 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

's setting.

5174.001 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These features often relate to behavior specific to post-trained models, such as refusal, responses to false information, responses to questions about the model's emotions, or specific tokens in the user's assistant dialog template.

5175.503 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

As above, these novel representations provide evidence against extreme views where post-trained LLMs are still essentially predictive models, predicting a conditional form of the pre-training distribution.

5188.958 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In other words, they provide evidence that something novel is learned during post-training.

5200.805 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, we don't currently have good ways to contextualize either a. the extent of the novel learning or b. the qualitative nature of the novel learning.

5206.332 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, are these novel representations mainly ways that the assistant persona is being extended?

5215.663 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Or do they represent from scratch learning?

5221.91 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Is this distinction important?

5224.854 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Heading Conclusion

5227.477 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In this post, we articulated the Persona Selection Model, PSM.

5230.336 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The view that AI assistant behavior is largely governed by an assistant persona that the underlying LLM learns to simulate, drawing on character archetypes and personality traits acquired during pre-training.

5235.188 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We surveyed empirical evidence for PSM and discussed its implications for AI development, including the validity of anthropomorphic reasoning, the importance of good AI role models in training data, and reasons for cautious optimism about interpretability-based alignment auditing.

5246.637 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We also explored the question of how exhaustive PSM is as a model of AI-assistant behavior.

5262.797 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We laid out a spectrum of views, from the shoggoth, which attributes substantial non-persona agency to the LLM itself, to the operating system, which attributes none, and discuss conceptual and empirical considerations bearing on this question.

5268.473 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We don't expect these views are exhaustive.

5283.152 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment