Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

That is, understanding, the LLMs model of, the assistant psychology is predictive of how the assistant will act in unseen situations.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, by understanding that Claude, by which we mean the assistant persona underlying the Claude AI assistant, has a preference against answering harmful queries, we can predict that Claude will have other downstream preferences, such as not wanting to be retrained to comply with harmful requests.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The second reason is more subtle.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Whereas the first reason pertain to understanding the psychology of a fixed assistant persona, PSM also recommends anthropomorphic reasoning about how training modifies the assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Suppose we have a training input X and we would like to decide how to evaluate a candidate AI assistant output Y. Here are two different questions we could ask to analyze how good of a response Y is.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Is Y the way we want the LLM to respond to X?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If we learned that a person responded to X with Y, what sort of a person would we think they are?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM recommends asking the latter question.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This often requires anthropomorphic reasoning about how AI assistants will learn from their training data, not unlike how parents, teachers, developmental psychologists, etc.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

reason about human children.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Below are some notable examples.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Inoculation prompting.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If we praise a child for bullying, they learn to be a bully.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

But if we praise a child for playing a bully in a school play, they will learn to be a good actor.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is true even though the actions the child performs might be superficially very similar.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It's clear from context which behavior is being reinforced.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It is the same with inoculation prompting.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

By changing the context of a training episode, we change what it implies about the assistant's character.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Producing insecure code when asked to is consistent with being helpful.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Producing it unprompted is evidence of malice.