Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM explains this as the LLM learning that the assistant knows how to use this syntax.

903.974 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The important thing is that the LLM still models the assistant as being an enacted persona.

909.883 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM does not assert the assistant is a single, coherent persona that is consistent across contexts.

915.751 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Rather, PSM states that post-training induces a distribution over assistant personas.

922.486 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, information provided at runtime, for example previous conversation context, further conditions this posterior.

928.455 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, PSM explains many-shot jailbreaks, which use few-shot prompts to make the assistant comply with harmful queries it would normally refuse, as providing overwhelming evidence that the assistant complies with all requests.

936.607 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM does not assert that LLMs always state in character.

950.532 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, certain queries can cause post-trained LLMs to generate base model-like completions rather than completions in the voice of the assistant, see Appendix A. PSM does not assert that the LLM's simulation of the assistant is perfect.

955.137 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, AI assistants sometimes behave bizarrely in ways that appear to be due to trying to simulate the assistant but doing so badly or awkwardly.

970.752 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We discuss this further in our section on complicating evidence.

979.456 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

That's the end of the list.

983.623 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Subheading.

985.846 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Empirical evidence for PSM.

987.269 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In this section, we discuss evidence for PSM coming from LLM generalization, behavioral observations about AI assistance, and LLM interpretability.

990.033 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We also discuss complicating evidence.

999.829 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

empirical observations which appear to be in tension with PSM on the surface, but which we believe have alternative, PSM-compatible explanations.

1002.473 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We also use our discussion of complicating evidence to clarify and caveat our statement of PSM.

1011.791 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Subheading Evidence from Generalization

1018.383 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM makes predictions about how LLMs will generalize from training data.

1022.15 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Specifically, given a training episode consisting of an input X and an output Y, PSM asks what sort of character would say Y in response to X. Then PSM predicts that training on the episode X, Y will make the assistant more like that sort of character.

1027.679 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment