Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
During pre-training, LLMs learn to be predictive models that are capable of simulating diverse personas based on entities appearing in training data.
Real humans, fictional characters, real and fictional AI systems, etc.
2.
Post-training refines the LLMs model of a certain persona which we call the assistant.
When users interact with an AI assistant, they are primarily interacting with this assistant persona.
the behavior of the resulting AI assistant can then be understood largely via the traits of the assistant persona.
This general idea is not unique to us.
Our goal in this post is to articulate and name the idea, discuss empirical evidence for it, and reflect on its consequences for AI development.
In the remainder of this post, we will
Describe the persona selection model, PSM, and supporting evidence.
For instance, we argue that PSM provides an explanation for various surprising results in the generalization and interpretability literatures.
Reflect on the consequences of PSM for AI development.
Insofar as PSM is a good model of AI-assistant behavior, it has some surprising consequences.
For instance, PSM recommends anthropomorphic reasoning about AI assistance and introduction of data to pre-training representing positive AI archetypes.
Ask how exhaustive PSM is as a model of AI assistant behavior.
Does understanding the assistant persona tell us everything we'd like to know?
We sketch out a spectrum of views on these questions, ranging from the popular masked shoggoth, where an outer agent can puppet the assistant towards its own ends, to an opposite perspective where the post-trained LLM is like a neutral operating system running a simulation that the assistant lives within.
We also discuss some relevant empirical observations and conceptual reasons that PSM may or may not be exhaustive, and we speculate about how this might change in the future.
There's an image here.
Description.