Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM explains this as the LLM learning that the assistant knows how to use this syntax.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The important thing is that the LLM still models the assistant as being an enacted persona.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM does not assert the assistant is a single, coherent persona that is consistent across contexts.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Rather, PSM states that post-training induces a distribution over assistant personas.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, information provided at runtime, for example previous conversation context, further conditions this posterior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, PSM explains many-shot jailbreaks, which use few-shot prompts to make the assistant comply with harmful queries it would normally refuse, as providing overwhelming evidence that the assistant complies with all requests.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM does not assert that LLMs always state in character.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, certain queries can cause post-trained LLMs to generate base model-like completions rather than completions in the voice of the assistant, see Appendix A. PSM does not assert that the LLM's simulation of the assistant is perfect.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, AI assistants sometimes behave bizarrely in ways that appear to be due to trying to simulate the assistant but doing so badly or awkwardly.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We discuss this further in our section on complicating evidence.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

That's the end of the list.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Empirical evidence for PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In this section, we discuss evidence for PSM coming from LLM generalization, behavioral observations about AI assistance, and LLM interpretability.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We also discuss complicating evidence.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

empirical observations which appear to be in tension with PSM on the surface, but which we believe have alternative, PSM-compatible explanations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We also use our discussion of complicating evidence to clarify and caveat our statement of PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading Evidence from Generalization

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM makes predictions about how LLMs will generalize from training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Specifically, given a training episode consisting of an input X and an output Y, PSM asks what sort of character would say Y in response to X. Then PSM predicts that training on the episode X, Y will make the assistant more like that sort of character.