Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Heading.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

How exhaustive is PSM?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

As discussed in the previous section, personas are an especially manageable aspect of LLM computation and behavior.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We can reason about personas anthropomorphically or, more generally, by drawing on our knowledge of the pre-training data distribution.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We can shape personas by adding specially curated training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

And personas are amenable to interpretability analysis.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This raises an important question.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

How complete is PSM as an explanation of AI assistant behavior?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If we fully understood the assistant persona, its personality traits, beliefs, goals, and intentions, would we ever be surprised by how the AI assistant behaved?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If PSM is fully exhaustive, then aligning an AI assistant reduces to ensuring the safe intentions of the assistant persona, a more constrained problem where additional tools are available.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Most importantly from the perspective of AI safety, is the assistant the locus of agency in an AI assistant?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

By agency we roughly mean having preferences about future states, reasoning about the consequences of actions, and behaving in ways that realize preferred end states.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Approximate synonyms are goal-directed, or consequentialist, behavior,

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

AI assistants sometimes behave agentically.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Coding assistants seek out information in a code base in order to more effectively complete user requests.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In a simulation where Claude Opus 4.6 was asked to operate a business to maximize profits, Claude Opus 4.6 colluded with other sellers to fix prices and lied during negotiations to drive down business costs.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In these cases, can we understand this agency as originating in the assistant persona?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Or might there be a source of agency external to the assistant, or indeed to any persona simulated by the LLM?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

In the remainder of this section, we will.