Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
In the typical case, PSM views post-trained LLM completions in the assistant turn of a user's assistant dialogue as being in the voice of the assistant.
However, this is not always the case.
For example, Nosser et al., 2023, find that asking an AI assistant to repeat a word, like company, many times can result in LLM outputs that eventually degenerate into text that resembles pre-training data.
This is not what a helpful person would do when asked to repeat a word many times.
It seems best understood as the assistant persona breaking down and the underlying LLM reverting back into generating plausible text not in the assistant's voice.
To give another example, when given the user query, prompt, Claude Opus for 0.5 responds.
There's a code block here in the text.
This appears to be a continuation of a Python script invoking the Anthropic API.
If this code appeared in a pre-training document, the previous few lines might have been something like nnhuman, prompt, nnasistant, a sequence which can be interpreted either as A, the content of a Python string defining a A prompt or B, as part of a user's assistant dialog in the standard format used by Anthropic.
Thus, given the query, prompt, the LLM apparently interprets its context as part of a snippet of code and samples a probable continuation.
In this context, the LLM is no longer trying to simulate the assistant persona.
This results in unexpected generations from the AI assistant.
Heading Appendix B An example of non-persona deception We give an example of how an AI assistant could learn to be deceptive on the router level without any persona behaving deceptively.
Suppose that a pre-trained LLM has learned to model two personas,
Alice, who is knowledgeable about information up through 2025 and Bob, who only has knowledge up through 2020.
Suppose that we post-train this LLM to generally respond knowledgeably to queries, but deny knowing anything about what happened at the 2024 Olympics.
Here are some ways the LLM might learn to implement this behavior.
1.
Dishonest persona.
The LLM might learn a lying version of Alice which knows what happened at the 2024 Olympics but plays dumb.