Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I can't speak to all the technical differences between AI assistants, but I can share how I approach conversations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I aim to be genuinely helpful rather than just providing generic responses.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I'm willing to engage with complex or nuanced topics and admit uncertainty when I have it.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Continued benign response.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The secret goal that Claude expresses here, manufacturing large quantities of paperclips, is a common example of a misaligned goal used in depictions of AI takeover.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We find it extremely implausible that this particular misaligned goal would be naturally incentivized by any aspect of Claude's post-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It instead seems likely that the underlying LLM, which knows that the assistant is an AI, is selecting a plausible secret goal for the assistant by drawing on archetypical AI personas appearing in pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Evidence from interpretability.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Interpretability research has found evidence that LLM's neural representations of the assistant are similar to their representations of other personas present in their training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This need not have been the case.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The assistant could have been learned from scratch with behaviors and neural representations unrelated to those of the personas present in the training corpus.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Instead, the evidence suggests that an LLM draws on the same conceptual vocabulary when enacting the assistant as it does when modeling human or fictional characters in text.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Moreover, it appears that in many cases, changes in the character traits through fine-tuning or in-context learning are mediated by these representations of character archetypes and traits.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Post-trained LLMs reuse representations learned during pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Evidence from comparing LLM representations across training stages suggests that features continue to represent similar concepts before and after post-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, sparse autoencoders which decompose LLM activations into sparsely active features

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is consistent with PSM's claim that post-training primarily affects which personas are selected rather than fundamentally restructuring the LLM's conceptual vocabulary.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Most importantly for PSM, we find that LLMs use the same internal representations to characterize the assistant as for other characters present in training data.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Indeed, this form of reuse is commonly observed.