Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
For understanding the behavior of AI assistants, it is the unfaithful actors which are most concerning, since faithful actors do not affect AI assistant behavior so long as they remain in character.
There's an image here.
Authors and narratives.
On the actor view, another persona might distort the assistant's behavior in service of that persona's own goals.
A related but distinct concern is that the LLM does not just simulate the assistant, but simulates an overall story in which the assistant is a character, a story that might go in unwelcome directions.
Consider a novel about a helpful AI assistant with a concerning narrative arc.
For example, perhaps it is a story like Breaking Bad where the assistant is genuinely helpful at first before becoming corrupted.
Or perhaps the assistant is an unwitting sleeper agent who could be set off at any moment, like in The Manchurian Candidate.
One could view the situation as there being a narrative agency which affects the behavior of the assistant.
Notably, this misaligned narrative isn't a fact about the psychology of the assistant.
The assistant does not plan or intend to become corrupted.
Rather, it's a fact about the psychology of an implicit author or about the narrative that the assistant is embedded in.
This latter case is especially interesting.
Unlike the author case, in the narrative case there is no longer a human-like persona whose psychology we can analyze.
On the other hand, even simulated narratives are persona-like in certain ways.
They are still anchored in the pre-training data distribution and so many of the same tools may be available for understanding a narrative agency as persona-based agency.
There's an image here.
Figure 4.
An overview of perspectives on PSM exhaustiveness.
By is understanding the assistant sufficient, we mean does understanding the assistant give a full account of AI assistant behavior?