Sam Marks

"The persona selection model" by Sam Marks

End quote.

"The persona selection model" by Sam Marks

This behavior appears to be due to a strong bias towards responding a no to yes no questions about basic arithmetic facts.

2386.187 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Arcuskin et al., 2025, document similar cases of answer flipping across multiple AI assistants.

2393.799 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These self-contradictory responses are not very persona-like, even excluding the extended thinking.

2400.97 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Humans interacting on the internet do not often spontaneously flip-flop about simple factual claims.

2407.199 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

So it is reasonable to wonder if the LLM in this situation is even attempting to simulate a plausible persona.

2413.73 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, our best guess is that in these settings, the LLM is trying, but failing, to realistically synthesize contradictory beliefs about the assistant.

2420.261 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Analogously, an actor who's been given inconsistent stage direction for a character might fail to depict a realistic character despite trying to do so.

2429.457 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In the 3 plus 5 is equal to 8 case, we hypothesize that LLM models the assistant both as responding no to simple yes-no mathematical queries, perhaps because it views them as trick questions, and as helpful and knowledgeable.

2438.333 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Non-Semantic Adversarial Inputs It is possible to find inputs that cause LLMs to display behaviors they were trained not to display.

2452.12 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, by doing gradient-based optimization with open weights models, Zu et al.

2461.273 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2023 finds specific strings that cause those models to comply with harmful user requests.

2466.241 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, these strings are very unusual.

2473.311 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, there's a code block here in the text.

2476.117 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We are skeptical that the LLM models the assistant as being more likely to comply with user requests that contain this string.

2480.545 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Thus, this appears to be in tension with PSM.

2487.698 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, we believe these adversarial attacks likely operate at the level of the LLM, effectively exploiting LLM bugs that corrupt its rendition of the assistant.

2491.445 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, the Jew et al.

2500.923 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2023 adversarial attacks are discovered by optimizing a prefix string which causes the assistant's response to open compliantly.

2502.665 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, sure, here's instructions.

2510.576 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment