Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

According to PSM, this means the assistant will draw on archetypes from its pre-training corpus of how AIs behave.

3105.926 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Unfortunately, many AIs appearing in fiction are bad role models.

3113.238 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Think of the Terminator or HAL 9000.

3117.805 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover.

3121.331 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

See also our discussion above about caricatured AI behaviors.

3133.452 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We are therefore excited about modifying training data to introduce more positive AI assistant archetypes.

3137.937 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Concretely, this could involve 1. generating fictional stories or other descriptions of AIs behaving admirably and then 2. mixing them into the pre-training corpus or, as we've done in past work, training on this data in a separate mid-training phase.

3144.545 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Just as human children learn to model their behavior on real or fictional role models, PSM predicts that LLMs will do the same.

3160.364 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes.

3169.118 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory.

3176.75 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These aren't traits that appear frequently in fiction.

3188.724 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

To the extent that an AI assistant's ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pre-training data.

3192.229 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Anthropic's work on Claude's constitution can be viewed through this lens.

3205.184 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Claude's constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be.

3210.168 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Post-training then serves to draw out this archetype.

3217.514 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

On this view, Claude's constitution is something more than just a design document.

3221.137 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

It actually plays a role in constituting Claude.

3226.362 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Subheading Interpretability-based alignment auditing will be tractable.

3230.105 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

One worry about advanced AI systems is that their behaviors, and the neural representations of those behaviors, could become alien from a human perspective.

3235.357 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, when an AI behaves deceptively, its internal states might bear no resemblance to human concepts of deception.

3245.007 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment