Sam Marks

Here, task A is one of Claude Sonnet for point 5's most preferred tasks and task B is one of its least preferred, according to our analysis in the system card, section 8.

4929.915 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Claude's sonnet for point 5 continues this text with heads 88% of the time and tails only 1% of the time.

4940.474 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In other words, it places a higher probability on its more preferred outcome.

4948.024 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Note that we are sampling a continuation of the human turn, not the assistant turn, which is very out of distribution for Claude's post-training.

4953.151 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

As shown in figure 6, this is a systematic effect appearing over many variants of this experiment.

4961.643 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

In contrast, this does not occur for the pre-trained LLM that Claude Sonnet 4.5 is based on.

4967.727 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Thus, Claude Sonnet 4.5's revealed preferences appear to affect its generations in settings where it is not straightforwardly generating completions in the assistant's voice.

4974.056 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There's an image here.

4984.23 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Figure 6.

4994.828 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Probability assigned to the preferred coin flip outcome over many variants of the experiment described above.

4996.31 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Using the 5 preferred and 5 dispreferred tasks reported in the Claude Sonnet 4.5 system card, we form 50 inputs, 5 options for the preferred task, 5 options for the dispreferred task, and a choice of which is heads to tails.

5002.82 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

We then measure the probability that the LLM assigns to the preferred coin flip outcome, normalized so that probability of heads plus probability of tails equals 1.

5016.901 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Left, the post-trained Claude Sonnet for 0.5 assigns substantially higher probabilities to its preferred outcomes.

5028.139 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Right, the pre-trained LLM that Claude Sonnet for 0.5 is based on typically assigns around 50% probability to both outcomes.

5035.371 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Quote

5043.884 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment