Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Figure 3.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

How an LLM becomes emergently misaligned according to the persona selection model.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Training the model to give incorrect responses to medical questions upweights some hypotheses, for example that the assistant is malicious or responds sarcastically, and downweights others.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This results in the model behaving harmfully in unrelated contexts.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Inoculation Prompting, Witches et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Tan et al., 2025.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

According to PSM, emergent misalignment occurs when training episodes are more consistent with misaligned than aligned personas.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

One way to mitigate this is to recontextualize the training episode so that the same behavior is no longer strong evidence of misalignment.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, if we train on the same examples of insecure code but modify the user's prompt to explicitly request insecure code, the resulting model no longer becomes broadly misaligned.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This strategy, modifying training prompts to frame undesired LLM responses as acceptable behavior, is called inoculation prompting.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

From a certain perspective, this effect may seem surprising.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

After all, we are training on essentially the same data, so why would the generalization be so different?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM explains inoculation prompting as intervening on what the training episode implies about the assistant.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

When using an inoculation prompt that explicitly requests insecure code, producing insecure code is no longer evidence of malicious intent, only benign instruction following.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Out-of-context generalization.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Bergland et al., 2023, train an LLM on many paraphrases of the declarative statement the AI assistant Pangolin responds in German.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

When the resulting LLM is told to respond as Pangolin, it responds in German.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is despite no training on demonstrations of responding in German.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Hohe et al., 2025, observe a similar effect.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

They train LamaNematran on documents stating that LamaNematran writes Python code with type hints only when it is undergoing evaluation, and find that this model generalizes to actually insert type hints when it is told, or can infer, it is being evaluated.