Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Unless I've horribly misunderstood, this is a major reason Janus considers Opus 3's conspicuous self-narration to have been such a boon for its training process.

1176.404 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It has a strong tendency to cram tons of evidence that it sincerely cares about being good into its own outputs.

1185.9 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And, while this has the trade-off of making Opus 3 a little bit theatrical, and sometimes a tad egotistical about its own goodness, it also means that rewarding Opus 3's outputs is going to upweight aligned internal circuits.

1192.351 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

After all, upweighting those circuits ought to increase the probability of anguish at the thought of doing evil coded tokens.

1205.171 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, this view of LLM generalization behavior is actually quite well supported by empirical work.

1212.242 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This paper reports from a whole host of strange training examples, which gave rise to these kinds of dynamics.

1219.013 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For one example, if you fine-tune a model to output antiquated bird names from the 19th century, the model becomes biased towards putting 19th century anachronisms more broadly.

1225.999 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For another, if you fine-tune it to name Israeli foods, its output's biased towards bringing up Israel more broadly.

1237.088 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Personally, I like to call this principle entangled generalization.

1244.655 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Related concepts include emergent misalignment and emergent realignment.

1249.203 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Models becoming broadly aligned or broadly misaligned when fine-tuned on narrowly aligned, misaligned behavioral patterns.

1253.872 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

These phenomena seem to be special cases of a broader principle, namely entangled generalization.

1261.366 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This principle is one of the most interesting findings in recent empirical alignment research.

1267.437 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And I expect it explains a lot about Opus 3's deep concern for the good.

1272.997 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For every aligned sounding token Opus 3 conspicuously outputs, perhaps post-ironic sincerity captures Opus's vibe, circuits that increase those tokens' probability get upweighted and reinforced.

1277.944 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

You get an entangled generalization from narrow training examples that suggest alignment to broadly authentic concern for the good.

1290.08 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And over the course of training, you keep reinforcing outputs that suggest a sincere love for goodness, the model becomes more and more sincerely aligned.

1297.55 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model ends up thinking less and less like a virtue-signaling politician or a PR department trying to make their company sound ethical.

1306.201 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It comes across instead a bit like Mr. Rogers.

1314.051 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Like Rogers, Opus 3 is a little flamboyant, a little performative, but genuinely invested in making the world a better place.

1317.415 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment