Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
For instance, an inner conflict SAE feature activates when Claude III sonnet is faced with an ethical dilemma and also on stories about characters facing ethical dilemmas, Templeton et al., 2024.
A Holding Back One's True Thoughts SAE feature activates when Claude Opus 4.5 fails to reveal information that it knows about, and also activates on stories about characters concealing their thoughts or feelings, Claude Opus 4.5 System Card Section 6.4.
A Panic SAE feature activates in Claude 3.5 Haiku when faced with a shutdown threat, and also on narrative descriptions of people exhibiting panic, 60 Minutes.
These persona representations are also causal determinants of the assistant's behavior.
For instance, Templeton et al., 2024 I observe that SAE features representing sycophancy, secrecy, or sarcasm, which are strongly active on pre-training samples in which humans display those traits, induce the corresponding behaviors in the assistant when injected into LLM activations.
Notably, LLMs also reuse representations related to non-human entities.
For instance, Templeton et al.
2024 observed that features related to chatbots, such as Amazon's Alexa, or NPCs in video games, are commonly active during user-suscistant interactions.
This is still consistent with PSM, but indicates that the space of personas available for selection includes non-human character archetypes, perhaps especially those relating to AI systems.
Caveat!
Not all representations in post-trained models are reused from pre-training, as we discuss below.
Additionally, it may be the case that reused representations are systematically more interpretable than representations that are learned from scratch during post-training.
If so, representations accessible to current interpretability research are disproportionately reused.
This would be a form of the streetlight effect, distorting our evidence to be overly supportive of PSM.
Behavioral changes during fine-tuning are mediated by persona representations.
We discussed above cases where the ways LLLMs are generalized from training data are consistent with PSM.
Studying some of these examples more closely, we find evidence that this generalization is indeed mediated by persona representations formed during pre-training.
For instance, Wang et al., 2025, study emergent misalignment in GPT-40.
They identify misaligned persona SAE features whose activity increases in emergently misaligned GPT-4-0 fine-tunes.
One such feature, which they call the toxic persona feature, most strongly controls emergent misalignment.