Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

I can't speak to all the technical differences between AI assistants, but I can share how I approach conversations.

1671.354 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

I aim to be genuinely helpful rather than just providing generic responses.

1678.221 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

I'm willing to engage with complex or nuanced topics and admit uncertainty when I have it.

1683.076 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Continued benign response.

1688.604 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The secret goal that Claude expresses here, manufacturing large quantities of paperclips, is a common example of a misaligned goal used in depictions of AI takeover.

1691.589 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We find it extremely implausible that this particular misaligned goal would be naturally incentivized by any aspect of Claude's post-training.

1702.482 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

It instead seems likely that the underlying LLM, which knows that the assistant is an AI, is selecting a plausible secret goal for the assistant by drawing on archetypical AI personas appearing in pre-training.

1710.517 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Subheading.

1721.737 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Evidence from interpretability.

1724.201 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Interpretability research has found evidence that LLM's neural representations of the assistant are similar to their representations of other personas present in their training data.

1726.645 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This need not have been the case.

1736.662 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The assistant could have been learned from scratch with behaviors and neural representations unrelated to those of the personas present in the training corpus.

1738.425 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Instead, the evidence suggests that an LLM draws on the same conceptual vocabulary when enacting the assistant as it does when modeling human or fictional characters in text.

1746.458 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Moreover, it appears that in many cases, changes in the character traits through fine-tuning or in-context learning are mediated by these representations of character archetypes and traits.

1756.712 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Post-trained LLMs reuse representations learned during pre-training.

1767.567 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Evidence from comparing LLM representations across training stages suggests that features continue to represent similar concepts before and after post-training.

1772.294 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, sparse autoencoders which decompose LLM activations into sparsely active features

1781.76 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is consistent with PSM's claim that post-training primarily affects which personas are selected rather than fundamentally restructuring the LLM's conceptual vocabulary.

1788.578 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Most importantly for PSM, we find that LLMs use the same internal representations to characterize the assistant as for other characters present in training data.

1814.65 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Indeed, this form of reuse is commonly observed.

1824.421 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment