Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
Here we discuss cases where AI assistants behave in non-human-like ways.
While these cases are, on their face, in tension with PSM, we overall think they have compelling PSM-compatible explanations.
Nevertheless, we think these case studies are useful for demonstrating what can and cannot be inferred from PSM.
Roughly speaking, we hypothesize that behaviors we discuss are caused by LLMs having limited capabilities or buggy behavior which distorts their rendition of the assistant.
That is, the LLM is trying to simulate the assistant, but its execution is limited by capabilities.
LLMs sometimes make mistakes that are not very human-like, for example stating that 9.11 greater than 9.9
Despite generally having advanced mathematical capabilities, producing bizarre responses to altered versions of well-known riddles, see for example the altered riddles dataset for examples, or failing at simple character counting tasks like counting the RS in Strawberry.
These UN human-like behaviors might appear to contradict PSM, which generally expects AI assistants to display human-like behavior.
However, we hypothesize that these examples are better understood as arising from the limited capabilities of the underlying LLM.
Suppose that we observe a character in a story state that water boils at 50 degrees Celsius.
This is false, since water boils at 100 degrees Celsius.
We could understand this mistake in various ways.
1.
The story's author understood the fact was erroneous and intended for the character to make a mistake.
2.
The author did not intend for the character to err but was unable to write the character better.
For example, perhaps the author themselves thought that water boils at 50 degrees Celsius.
3.
The text in the character's dialogue was playing some role other than being the author's best attempt at simulating how the character would behave.
For example, perhaps the author is trying to send encoded messages to readers using digits that appear in the book's text.