Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
We describe the Persona Selection Model, PSM, the idea that LLMs learn to simulate diverse characters during pre-training and post-training elicits and refines a particular such assistant persona.
Interactions with an AI assistant are then well understood as being interactions with the assistant, something roughly like a character in an LLM-generated story.
We survey empirical behavioral, generalization, and interpretability-based evidence for PSM.
PSM has consequences for AI development, such as recommending anthropomorphic reasoning about AI psychology and introduction of positive AI archetypes into training data.
An important open question is how exhaustive PSM is, especially whether there might be sources of agency external to the assistant persona, and how this might change in the future.
Heading Introduction What sort of thing is a modern AI assistant?
One perspective holds that they are shallow, rigid systems that narrowly pattern-match user inputs to training data.
Another perspective regards AI systems as alien creatures with learned goals, behaviors, and patterns of thought that are fundamentally inscrutable to us.
A third option is to anthropomorphize AIs and regard them as something like a digital human.
Developing good mental models for AI systems is important for predicting and controlling their behaviors.
If our goal is to make AI assistants that are useful and aligned with human values, the right approach will differ quite a bit if we are dealing with inflexible computer programs, aliens, or digital humans.
Of these perspectives, the third one, that AI systems are like digital humans, might seem the most unintuitive.
After all, the neural architectures of modern large language models, LLMs, are very different from human brains, and LLM training is quite unlike biological evolution or human learning.
That said, in our experience, AI assistants like Claude are shockingly human-like.
For example, they often appear to express emotions, like frustration when struggling with a task, despite no explicit training to do so.
And, as we'll discuss, we observe deeper forms of human likeness in how they generalize from their training data and internally represent their own behaviors.
In this post, we share a mental model we have found useful for understanding AI assistants and predicting their behaviors.
Under this model, LLMs are best thought of as actors or authors capable of simulating a vast repertoire of characters, and the AI assistant that users interact with is one such character.
In more detail, this model, which we call the Persona Selection Model, PSM, states that
1.