Sam Marks

"The persona selection model" by Sam Marks

For understanding the behavior of AI assistants, it is the unfaithful actors which are most concerning, since faithful actors do not affect AI assistant behavior so long as they remain in character.

4121.867 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There's an image here.

4132.743 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Authors and narratives.

4140.799 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

On the actor view, another persona might distort the assistant's behavior in service of that persona's own goals.

4143.005 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

A related but distinct concern is that the LLM does not just simulate the assistant, but simulates an overall story in which the assistant is a character, a story that might go in unwelcome directions.

4149.61 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Consider a novel about a helpful AI assistant with a concerning narrative arc.

4161.11 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, perhaps it is a story like Breaking Bad where the assistant is genuinely helpful at first before becoming corrupted.

4165.878 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Or perhaps the assistant is an unwitting sleeper agent who could be set off at any moment, like in The Manchurian Candidate.

4173.607 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

One could view the situation as there being a narrative agency which affects the behavior of the assistant.

4180.736 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Notably, this misaligned narrative isn't a fact about the psychology of the assistant.

4186.683 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The assistant does not plan or intend to become corrupted.

4191.91 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Rather, it's a fact about the psychology of an implicit author or about the narrative that the assistant is embedded in.

4195.713 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This latter case is especially interesting.

4202.588 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Unlike the author case, in the narrative case there is no longer a human-like persona whose psychology we can analyze.

4205.714 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

On the other hand, even simulated narratives are persona-like in certain ways.

4212.706 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They are still anchored in the pre-training data distribution and so many of the same tools may be available for understanding a narrative agency as persona-based agency.

4217.973 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There's an image here.

4227.486 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Figure 4.

4237.26 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

An overview of perspectives on PSM exhaustiveness.