Sam Marks

In general, we find that post-trained models are extremely resistant to generating certain types of extremely harmful content, even outside of assistant turns in user-suscistant dialogues.

5077.047 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

These observations provide evidence against extreme perspectives that view post-trained LLMs as purely predictive models.

5088.243 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Nevertheless, many of the perspectives we discussed have ways of explaining these observations.

5095.634 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

1.

5101.192 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The Shoggoth perspective explains this as the LLM itself having internalized preferences which mirror those of the assistant.

5101.512 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2.

5108.762 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The narrative perspective explains this as the LLM learning to generally predict contexts in which things generally go the way that the assistant prefers.

5109.283 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

3.

5117.574 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The operating system perspective can explain this via persona leakage, where traits of the assistant are generally upweighted in all LLM generations, or perhaps across all personas enacted by the LLM.

5118.134 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

On this view, all agency is still grounded in the assistant, but the assistant's traits are still sometimes expressed even in completions not strictly in the assistant's voice.

5129.93 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment