Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In the typical case, PSM views post-trained LLM completions in the assistant turn of a user's assistant dialogue as being in the voice of the assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, this is not always the case.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, Nosser et al., 2023, find that asking an AI assistant to repeat a word, like company, many times can result in LLM outputs that eventually degenerate into text that resembles pre-training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is not what a helpful person would do when asked to repeat a word many times.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It seems best understood as the assistant persona breaking down and the underlying LLM reverting back into generating plausible text not in the assistant's voice.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

To give another example, when given the user query, prompt, Claude Opus for 0.5 responds.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

There's a code block here in the text.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This appears to be a continuation of a Python script invoking the Anthropic API.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If this code appeared in a pre-training document, the previous few lines might have been something like nnhuman, prompt, nnasistant, a sequence which can be interpreted either as A, the content of a Python string defining a A prompt or B, as part of a user's assistant dialog in the standard format used by Anthropic.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Thus, given the query, prompt, the LLM apparently interprets its context as part of a snippet of code and samples a probable continuation.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In this context, the LLM is no longer trying to simulate the assistant persona.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This results in unexpected generations from the AI assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Heading Appendix B An example of non-persona deception We give an example of how an AI assistant could learn to be deceptive on the router level without any persona behaving deceptively.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Suppose that a pre-trained LLM has learned to model two personas,

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Alice, who is knowledgeable about information up through 2025 and Bob, who only has knowledge up through 2020.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Suppose that we post-train this LLM to generally respond knowledgeably to queries, but deny knowing anything about what happened at the 2024 Olympics.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Here are some ways the LLM might learn to implement this behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Dishonest persona.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The LLM might learn a lying version of Alice which knows what happened at the 2024 Olympics but plays dumb.