Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

we might further conclude that the person is inauthentic or dishonest.

2862.605 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM predicts that the LLM will draw similar conclusions about the assistant persona.

2866.964 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Similar remarks apply for approach 2.

2872.848 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, when the assistant responds eagerly to aggressive users instead of expressing frustration, the LLM might infer that the assistant is actually frustrated but lies about it.

2876.294 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The LLM might conclude that the assistant is more deceptive in general, though hopefully this would only extend to white lies.

2886.753 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

The can responses in approach, for, are very strange from the perspective of personas learned in pre-training, so it is unclear what knock-on effects this training would have.

2894.414 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

That said, a more natural approach would be to first teach the LLM that we train AI assistants to respond in this way, thereby giving the LLM a conceptual grasp on the behavior and where it comes from.

2904.094 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

I don't know versus I can't say.

2915.457 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Suppose we would like to train an LLM to not disclose the contents of its system prompt if the system prompt instructs it not to.

2917.399 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Consider the following two possible responses to the user query.