Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Such divergence could make internals-based auditing of models extremely difficult.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM offers a few reasons for optimism.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

First, PSM constrains the hypothesis space.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It suggests that dangerous AI behaviors won't arise from unpredictable alien drives or cognitive processes.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Rather, we expect dangerous AI behaviors and their causes to look familiar to humans, arising from personality traits like ambition, megalomania, paranoia, or resentment.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Second, neural representations of these behaviors and traits will be substantially reused from pre-training.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

When the assistant behaves deceptively, the LLM will represent this similarly to examples of deceptive human behavior in the pre-training corpus.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This means that AI developers will have access to a large corpus of data useful for isolating and studying representations of interest.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Third, because the LLM is selecting from a bank of personas that it is capable of representing, traits of the assistant persona might be actively represented at runtime.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For instance, Wang et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2025 and Chen et al.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

2025 found that internal representations of personas that mediate emergent misalignment are active in the fine-tuned, misaligned model.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Taken together, these considerations point towards interpretability-based alignment audits may remain tractable and informative.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

This is especially true for top-down interpretability techniques, that is those that rely on pre-formed hypotheses.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, it may be productive to, as Anthropic does during our pre-deployment alignment audits, Claude 4.5 System Card, Section 6.12.2, build and monitor activation probes for a research-curated set of traits like deception and evaluation awareness.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

A related question is whether models will develop neuralese, a private language in their extended reasoning traces that is optimized for task performance but incomprehensible to human monitors.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

If this occurred, it would undermine chain-of-thought monitoring as a safety technique.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

It is unclear whether PSM makes predictions about neuralese.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Insofar as reasoning LLMs understand their chains of thought as being part of the assistant's behavior, for example a representation of what the assistant is thinking, PSM would predict that they would remain legible.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, it is unclear whether LLMs understand chains of thought in this way, as opposed to an internal computation instrumental in simulating assistant behavior.