Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
Pre-training teaches an LLM a distribution over personas.
Implicit in this distribution are various hypotheses about the assistant persona.
Is it helpful?
Rude?
Manipulative?
Post-training can be viewed as updating this distribution using training episodes as evidence.
When training an AI assistant on an input XX, output YY, a pair, hypotheses that predict the assistant would respond with YY to XX are upweighted.
Hypotheses that predict the opposite are downweighted.
This results in a posterior distribution over assistant personas.
Because this is still a distribution, stochasticity and contextual information provided at runtime still affect the assistant persona simulated during a given rollout.
Assistant persona behavior is a key determiner of AI assistant behavior.
To predict how an AI assistant will behave, PSM recommends asking what would the assistant do?
According to the beliefs of the post-trained LLM simulating the assistant.
We clarify some claims which PSM does not make.
There's a list of bullet points here.
PSM does not assert that understanding the assistant persona gives an exhaustive account of AI assistant behavior.
We view the exhaustiveness of PSM as being an important open question, which we discuss at length below.
PSM does not rule out learning of new capabilities during post-training.
For example, no persona learned during pre-training knows how to use anthropic syntax for tool calling.
That capability is learned during post-training.