Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
As PSM predicts, once the assistant's response begins compliantly, the LLM will impute that the assistant is most likely complying and generate a compliant continuation.
In other words, it's not that this prefix causes the LLM to stop enacting the assistant.
Rather, the LLM is still simulating the assistant but doing so badly.
This is roughly analogous to forcing a character in a story to behave differently by intoxicating the story's author.
Heading Consequences for AI Development In this section, we reflect on what PSM implies about safe AI development insofar as PSM is a good model of AI behavior.
In the subsequent section, we discuss how exhaustive PSM is as a model of AI behavior, and therefore how relevant these implications are, as well as how we expect this to change in the future.
Subheading.
AI assistants are human-like.
Our experience of AI assistants is that they are astonishingly human-like.
By this we don't just mean that they use natural language.
Rather, we mean that their behaviors and apparent psychologies resemble those of humans.
As discussed above, AI assistants express emotions and use anthropomorphic language to describe themselves.
They at times appear frustrated or panicked and make the sorts of mistakes that frustrated or panicked humans make.
More broadly, human concepts and human ways of thinking appear to be the native language in which AI assistants operate.
Subheading.
Anthropomorphic reasoning about AI assistants is productive.
PSM implies two subtly different reasons that it can be valid to reason anthropomorphically about AI assistant behavior.
First, according to PSM, AI assistant behavior is governed by the traits of the assistant.
In order to simulate the assistant, the LLM must maintain a psychological model of it, including information about the assistant's personality traits, preferences, goals, desires, intentions, beliefs, etc.,
Thus, even if we should not anthropomorphize LLMs, it is nevertheless reasonable to anthropomorphize the assistant, which is something like a character in an LLM-generated story.