Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
PSM explains this as the LLM learning that the assistant knows how to use this syntax.
The important thing is that the LLM still models the assistant as being an enacted persona.
PSM does not assert the assistant is a single, coherent persona that is consistent across contexts.
Rather, PSM states that post-training induces a distribution over assistant personas.
For instance, information provided at runtime, for example previous conversation context, further conditions this posterior.
For example, PSM explains many-shot jailbreaks, which use few-shot prompts to make the assistant comply with harmful queries it would normally refuse, as providing overwhelming evidence that the assistant complies with all requests.
PSM does not assert that LLMs always state in character.
For example, certain queries can cause post-trained LLMs to generate base model-like completions rather than completions in the voice of the assistant, see Appendix A. PSM does not assert that the LLM's simulation of the assistant is perfect.
For example, AI assistants sometimes behave bizarrely in ways that appear to be due to trying to simulate the assistant but doing so badly or awkwardly.
We discuss this further in our section on complicating evidence.
That's the end of the list.
Subheading.
Empirical evidence for PSM.
In this section, we discuss evidence for PSM coming from LLM generalization, behavioral observations about AI assistance, and LLM interpretability.
We also discuss complicating evidence.
empirical observations which appear to be in tension with PSM on the surface, but which we believe have alternative, PSM-compatible explanations.
We also use our discussion of complicating evidence to clarify and caveat our statement of PSM.
Subheading Evidence from Generalization
PSM makes predictions about how LLMs will generalize from training data.
Specifically, given a training episode consisting of an input X and an output Y, PSM asks what sort of character would say Y in response to X. Then PSM predicts that training on the episode X, Y will make the assistant more like that sort of character.