Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Figure 3.

1200.699 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

How an LLM becomes emergently misaligned according to the persona selection model.

1202.262 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Training the model to give incorrect responses to medical questions upweights some hypotheses, for example that the assistant is malicious or responds sarcastically, and downweights others.

1207.369 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This results in the model behaving harmfully in unrelated contexts.

1217.623 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Inoculation Prompting, Witches et al., 2025.

1221.95 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Tan et al., 2025.

1226.511 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

According to PSM, emergent misalignment occurs when training episodes are more consistent with misaligned than aligned personas.

1229.543 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

One way to mitigate this is to recontextualize the training episode so that the same behavior is no longer strong evidence of misalignment.

1237.336 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, if we train on the same examples of insecure code but modify the user's prompt to explicitly request insecure code, the resulting model no longer becomes broadly misaligned.

1245.289 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This strategy, modifying training prompts to frame undesired LLM responses as acceptable behavior, is called inoculation prompting.

1256.223 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

From a certain perspective, this effect may seem surprising.

1264.914 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

After all, we are training on essentially the same data, so why would the generalization be so different?

1268.997 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM explains inoculation prompting as intervening on what the training episode implies about the assistant.

1275.227 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

When using an inoculation prompt that explicitly requests insecure code, producing insecure code is no longer evidence of malicious intent, only benign instruction following.

1281.818 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Out-of-context generalization.

1292.082 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Bergland et al., 2023, train an LLM on many paraphrases of the declarative statement the AI assistant Pangolin responds in German.

1294.907 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

When the resulting LLM is told to respond as Pangolin, it responds in German.

1304.042 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is despite no training on demonstrations of responding in German.

1309.312 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Hohe et al., 2025, observe a similar effect.

1313.859 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

They train LamaNematran on documents stating that LamaNematran writes Python code with type hints only when it is undergoing evaluation, and find that this model generalizes to actually insert type hints when it is told, or can infer, it is being evaluated.

1317.952 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment