Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
That is, understanding, the LLMs model of, the assistant psychology is predictive of how the assistant will act in unseen situations.
For example, by understanding that Claude, by which we mean the assistant persona underlying the Claude AI assistant, has a preference against answering harmful queries, we can predict that Claude will have other downstream preferences, such as not wanting to be retrained to comply with harmful requests.
The second reason is more subtle.
Whereas the first reason pertain to understanding the psychology of a fixed assistant persona, PSM also recommends anthropomorphic reasoning about how training modifies the assistant.
Suppose we have a training input X and we would like to decide how to evaluate a candidate AI assistant output Y. Here are two different questions we could ask to analyze how good of a response Y is.
Is Y the way we want the LLM to respond to X?
If we learned that a person responded to X with Y, what sort of a person would we think they are?
PSM recommends asking the latter question.
This often requires anthropomorphic reasoning about how AI assistants will learn from their training data, not unlike how parents, teachers, developmental psychologists, etc.
reason about human children.
Below are some notable examples.
Inoculation prompting.
If we praise a child for bullying, they learn to be a bully.
But if we praise a child for playing a bully in a school play, they will learn to be a good actor.
This is true even though the actions the child performs might be superficially very similar.
It's clear from context which behavior is being reinforced.
It is the same with inoculation prompting.
By changing the context of a training episode, we change what it implies about the assistant's character.
Producing insecure code when asked to is consistent with being helpful.
Producing it unprompted is evidence of malice.