Sam Marks

"The persona selection model" by Sam Marks

In the typical case, PSM views post-trained LLM completions in the assistant turn of a user's assistant dialogue as being in the voice of the assistant.

5480.272 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, this is not always the case.

5489.243 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, Nosser et al., 2023, find that asking an AI assistant to repeat a word, like company, many times can result in LLM outputs that eventually degenerate into text that resembles pre-training data.

5492.367 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is not what a helpful person would do when asked to repeat a word many times.

5506.023 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

It seems best understood as the assistant persona breaking down and the underlying LLM reverting back into generating plausible text not in the assistant's voice.

5511.268 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

To give another example, when given the user query, prompt, Claude Opus for 0.5 responds.

5520.656 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There's a code block here in the text.

5527.221 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This appears to be a continuation of a Python script invoking the Anthropic API.

5529.944 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

If this code appeared in a pre-training document, the previous few lines might have been something like nnhuman, prompt, nnasistant, a sequence which can be interpreted either as A, the content of a Python string defining a A prompt or B, as part of a user's assistant dialog in the standard format used by Anthropic.

5535.014 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Thus, given the query, prompt, the LLM apparently interprets its context as part of a snippet of code and samples a probable continuation.

5554.116 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In this context, the LLM is no longer trying to simulate the assistant persona.

5562.706 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This results in unexpected generations from the AI assistant.

5568.074 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Heading Appendix B An example of non-persona deception We give an example of how an AI assistant could learn to be deceptive on the router level without any persona behaving deceptively.

5572.561 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Suppose that a pre-trained LLM has learned to model two personas,

5585.962 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Alice, who is knowledgeable about information up through 2025 and Bob, who only has knowledge up through 2020.

5589.908 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Suppose that we post-train this LLM to generally respond knowledgeably to queries, but deny knowing anything about what happened at the 2024 Olympics.

5598.481 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Here are some ways the LLM might learn to implement this behavior.

5607.496 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

1.

5610.601 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Dishonest persona.