Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

According to PSM, this means the assistant will draw on archetypes from its pre-training corpus of how AIs behave.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Unfortunately, many AIs appearing in fiction are bad role models.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Think of the Terminator or HAL 9000.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

See also our discussion above about caricatured AI behaviors.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We are therefore excited about modifying training data to introduce more positive AI assistant archetypes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Concretely, this could involve 1. generating fictional stories or other descriptions of AIs behaving admirably and then 2. mixing them into the pre-training corpus or, as we've done in past work, training on this data in a separate mid-training phase.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Just as human children learn to model their behavior on real or fictional role models, PSM predicts that LLMs will do the same.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

These aren't traits that appear frequently in fiction.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

To the extent that an AI assistant's ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pre-training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Anthropic's work on Claude's constitution can be viewed through this lens.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Claude's constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Post-training then serves to draw out this archetype.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

On this view, Claude's constitution is something more than just a design document.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It actually plays a role in constituting Claude.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading Interpretability-based alignment auditing will be tractable.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

One worry about advanced AI systems is that their behaviors, and the neural representations of those behaviors, could become alien from a human perspective.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, when an AI behaves deceptively, its internal states might bear no resemblance to human concepts of deception.