Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Representation specific to post-trained models.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Despite the evidence described above for substantial representational reuse between pre-trained and post-trained models, post-trained models do not exclusively reuse representations from pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, SAE transfer between base and post-trained models is not perfect, and previous studies, Lindsay et al., 2024.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Minder et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2025 have found evidence for features that are specific to post-trained models, albeit a relatively small fraction under 1% in Minder et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

's setting.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

These features often relate to behavior specific to post-trained models, such as refusal, responses to false information, responses to questions about the model's emotions, or specific tokens in the user's assistant dialog template.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

As above, these novel representations provide evidence against extreme views where post-trained LLMs are still essentially predictive models, predicting a conditional form of the pre-training distribution.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In other words, they provide evidence that something novel is learned during post-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, we don't currently have good ways to contextualize either a. the extent of the novel learning or b. the qualitative nature of the novel learning.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, are these novel representations mainly ways that the assistant persona is being extended?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Or do they represent from scratch learning?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Is this distinction important?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Heading Conclusion

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In this post, we articulated the Persona Selection Model, PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The view that AI assistant behavior is largely governed by an assistant persona that the underlying LLM learns to simulate, drawing on character archetypes and personality traits acquired during pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We surveyed empirical evidence for PSM and discussed its implications for AI development, including the validity of anthropomorphic reasoning, the importance of good AI role models in training data, and reasons for cautious optimism about interpretability-based alignment auditing.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We also explored the question of how exhaustive PSM is as a model of AI-assistant behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We laid out a spectrum of views, from the shoggoth, which attributes substantial non-persona agency to the LLM itself, to the operating system, which attributes none, and discuss conceptual and empirical considerations bearing on this question.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We don't expect these views are exhaustive.