Sam Marks

"The persona selection model" by Sam Marks

That is, understanding, the LLMs model of, the assistant psychology is predictive of how the assistant will act in unseen situations.

2647.478 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, by understanding that Claude, by which we mean the assistant persona underlying the Claude AI assistant, has a preference against answering harmful queries, we can predict that Claude will have other downstream preferences, such as not wanting to be retrained to comply with harmful requests.

2655.869 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The second reason is more subtle.

2673.111 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Whereas the first reason pertain to understanding the psychology of a fixed assistant persona, PSM also recommends anthropomorphic reasoning about how training modifies the assistant.

2675.774 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Suppose we have a training input X and we would like to decide how to evaluate a candidate AI assistant output Y. Here are two different questions we could ask to analyze how good of a response Y is.

2686.143 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Is Y the way we want the LLM to respond to X?

2699.017 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

If we learned that a person responded to X with Y, what sort of a person would we think they are?

2703.122 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM recommends asking the latter question.

2709.369 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This often requires anthropomorphic reasoning about how AI assistants will learn from their training data, not unlike how parents, teachers, developmental psychologists, etc.

2712.653 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

reason about human children.

2722.813 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Below are some notable examples.

2725.278 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Inoculation prompting.