Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
Steering the LLM with this SAE feature amplifies or suppresses misaligned behavior.
Notably, they find that this feature also activates on quotes from morally questionable characters in pre-training documents.
This suggests that fine-tuning doesn't create misalignment from scratch.
Rather, it steers the LLM toward pre-existing character archetypes, as PSM would predict.
Generalizing the above finding, Chen et al.
2025 demonstrated that a number of personality traits, like weevil, sycophancy, or propensity to hallucinate, are encoded in LLM activations.
These persona vectors causally induce the associated behavior and can be upweighted or downweighted by training data, system prompts, or in-context examples of the trait.
The fact that these same representations mediate both prompt-induced and training-induced persona shifts suggests that the training time shifts can be regarded as conditioning, consistent with PSM.
The authors also found evidence that persona vectors are built out of concepts learned during pre-training they can be decomposed into more granular SAE features, for example evil decomposes into psychological manipulation, insults, conspiracy theories, which activate on pre-training data illustrating these concepts.
The assistant persona is mediated by character representations learned in pre-training.
Luit Al, 2025, identify an assistant axis in activation space that appears to encode model's identity as an AI assistant and associated traits.
The assistant occupies an extreme end of this axis and is located nearby in latent space to helpful, professional human archetypes.
Steering in the opposite direction appears to cause models to forget that they are an AI assistant.
Notably, this axis is not created during post-training.
The same axis exists in the pre-trained counterparts to these models, where it appears to represent assistant-like human characters.
Lu et al.
also found that certain conversational patterns, such as emotional conversations, could cause the model to drift away from this region of activation space, with corresponding increases in UN assistant-like behavior.
This provides direct evidence that post-training selects a particular default region of a pre-existing persona space corresponding to assistant behavior and that this persona exists within a larger space of possible personas which can be accessed through contextual cues.
Subheading.
Complicating evidence.