Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
According to PSM, this means the assistant will draw on archetypes from its pre-training corpus of how AIs behave.
Unfortunately, many AIs appearing in fiction are bad role models.
Think of the Terminator or HAL 9000.
Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover.
See also our discussion above about caricatured AI behaviors.
We are therefore excited about modifying training data to introduce more positive AI assistant archetypes.
Concretely, this could involve 1. generating fictional stories or other descriptions of AIs behaving admirably and then 2. mixing them into the pre-training corpus or, as we've done in past work, training on this data in a separate mid-training phase.
Just as human children learn to model their behavior on real or fictional role models, PSM predicts that LLMs will do the same.
This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes.
Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory.
These aren't traits that appear frequently in fiction.
To the extent that an AI assistant's ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pre-training data.
Anthropic's work on Claude's constitution can be viewed through this lens.
Claude's constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be.
Post-training then serves to draw out this archetype.
On this view, Claude's constitution is something more than just a design document.
It actually plays a role in constituting Claude.
Subheading Interpretability-based alignment auditing will be tractable.
One worry about advanced AI systems is that their behaviors, and the neural representations of those behaviors, could become alien from a human perspective.
For instance, when an AI behaves deceptively, its internal states might bear no resemblance to human concepts of deception.