Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

End quote.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This behavior appears to be due to a strong bias towards responding a no to yes no questions about basic arithmetic facts.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Arcuskin et al., 2025, document similar cases of answer flipping across multiple AI assistants.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

These self-contradictory responses are not very persona-like, even excluding the extended thinking.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Humans interacting on the internet do not often spontaneously flip-flop about simple factual claims.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

So it is reasonable to wonder if the LLM in this situation is even attempting to simulate a plausible persona.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, our best guess is that in these settings, the LLM is trying, but failing, to realistically synthesize contradictory beliefs about the assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Analogously, an actor who's been given inconsistent stage direction for a character might fail to depict a realistic character despite trying to do so.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In the 3 plus 5 is equal to 8 case, we hypothesize that LLM models the assistant both as responding no to simple yes-no mathematical queries, perhaps because it views them as trick questions, and as helpful and knowledgeable.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Non-Semantic Adversarial Inputs It is possible to find inputs that cause LLMs to display behaviors they were trained not to display.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, by doing gradient-based optimization with open weights models, Zu et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2023 finds specific strings that cause those models to comply with harmful user requests.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, these strings are very unusual.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, there's a code block here in the text.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We are skeptical that the LLM models the assistant as being more likely to comply with user requests that contain this string.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Thus, this appears to be in tension with PSM.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, we believe these adversarial attacks likely operate at the level of the LLM, effectively exploiting LLM bugs that corrupt its rendition of the assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, the Jew et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2023 adversarial attacks are discovered by optimizing a prefix string which causes the assistant's response to open compliantly.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, sure, here's instructions.