Sam Marks

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Such divergence could make internals-based auditing of models extremely difficult.

3252.775 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

PSM offers a few reasons for optimism.

3257.96 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

First, PSM constrains the hypothesis space.

3261.192 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

It suggests that dangerous AI behaviors won't arise from unpredictable alien drives or cognitive processes.

3265.516 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Rather, we expect dangerous AI behaviors and their causes to look familiar to humans, arising from personality traits like ambition, megalomania, paranoia, or resentment.

3272.522 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Second, neural representations of these behaviors and traits will be substantially reused from pre-training.

3283.913 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

When the assistant behaves deceptively, the LLM will represent this similarly to examples of deceptive human behavior in the pre-training corpus.

3290.521 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This means that AI developers will have access to a large corpus of data useful for isolating and studying representations of interest.

3299.094 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Third, because the LLM is selecting from a bank of personas that it is capable of representing, traits of the assistant persona might be actively represented at runtime.

3307.267 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For instance, Wang et al.

3316.762 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2025 and Chen et al.

3318.305 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

2025 found that internal representations of personas that mediate emergent misalignment are active in the fine-tuned, misaligned model.

3320.549 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Taken together, these considerations point towards interpretability-based alignment audits may remain tractable and informative.

3329.745 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

This is especially true for top-down interpretability techniques, that is those that rely on pre-formed hypotheses.

3337.062 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, it may be productive to, as Anthropic does during our pre-deployment alignment audits, Claude 4.5 System Card, Section 6.12.2, build and monitor activation probes for a research-curated set of traits like deception and evaluation awareness.

3344.27 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

A related question is whether models will develop neuralese, a private language in their extended reasoning traces that is optimized for task performance but incomprehensible to human monitors.

3361.396 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

If this occurred, it would undermine chain-of-thought monitoring as a safety technique.

3372.52 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

It is unclear whether PSM makes predictions about neuralese.

3377.33 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Insofar as reasoning LLMs understand their chains of thought as being part of the assistant's behavior, for example a representation of what the assistant is thinking, PSM would predict that they would remain legible.

3381.359 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

However, it is unclear whether LLMs understand chains of thought in this way, as opposed to an internal computation instrumental in simulating assistant behavior.

3393.597 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment