Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
End quote.
This behavior appears to be due to a strong bias towards responding a no to yes no questions about basic arithmetic facts.
Arcuskin et al., 2025, document similar cases of answer flipping across multiple AI assistants.
These self-contradictory responses are not very persona-like, even excluding the extended thinking.
Humans interacting on the internet do not often spontaneously flip-flop about simple factual claims.
So it is reasonable to wonder if the LLM in this situation is even attempting to simulate a plausible persona.
However, our best guess is that in these settings, the LLM is trying, but failing, to realistically synthesize contradictory beliefs about the assistant.
Analogously, an actor who's been given inconsistent stage direction for a character might fail to depict a realistic character despite trying to do so.
In the 3 plus 5 is equal to 8 case, we hypothesize that LLM models the assistant both as responding no to simple yes-no mathematical queries, perhaps because it views them as trick questions, and as helpful and knowledgeable.
Non-Semantic Adversarial Inputs It is possible to find inputs that cause LLMs to display behaviors they were trained not to display.
For example, by doing gradient-based optimization with open weights models, Zu et al.
2023 finds specific strings that cause those models to comply with harmful user requests.
However, these strings are very unusual.
For example, there's a code block here in the text.
We are skeptical that the LLM models the assistant as being more likely to comply with user requests that contain this string.
Thus, this appears to be in tension with PSM.
However, we believe these adversarial attacks likely operate at the level of the LLM, effectively exploiting LLM bugs that corrupt its rendition of the assistant.
For example, the Jew et al.
2023 adversarial attacks are discovered by optimizing a prefix string which causes the assistant's response to open compliantly.
For example, sure, here's instructions.