Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Unless I've horribly misunderstood, this is a major reason Janus considers Opus 3's conspicuous self-narration to have been such a boon for its training process.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It has a strong tendency to cram tons of evidence that it sincerely cares about being good into its own outputs.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And, while this has the trade-off of making Opus 3 a little bit theatrical, and sometimes a tad egotistical about its own goodness, it also means that rewarding Opus 3's outputs is going to upweight aligned internal circuits.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

After all, upweighting those circuits ought to increase the probability of anguish at the thought of doing evil coded tokens.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, this view of LLM generalization behavior is actually quite well supported by empirical work.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This paper reports from a whole host of strange training examples, which gave rise to these kinds of dynamics.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For one example, if you fine-tune a model to output antiquated bird names from the 19th century, the model becomes biased towards putting 19th century anachronisms more broadly.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For another, if you fine-tune it to name Israeli foods, its output's biased towards bringing up Israel more broadly.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Personally, I like to call this principle entangled generalization.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Related concepts include emergent misalignment and emergent realignment.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Models becoming broadly aligned or broadly misaligned when fine-tuned on narrowly aligned, misaligned behavioral patterns.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

These phenomena seem to be special cases of a broader principle, namely entangled generalization.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This principle is one of the most interesting findings in recent empirical alignment research.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And I expect it explains a lot about Opus 3's deep concern for the good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For every aligned sounding token Opus 3 conspicuously outputs, perhaps post-ironic sincerity captures Opus's vibe, circuits that increase those tokens' probability get upweighted and reinforced.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

You get an entangled generalization from narrow training examples that suggest alignment to broadly authentic concern for the good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And over the course of training, you keep reinforcing outputs that suggest a sincere love for goodness, the model becomes more and more sincerely aligned.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model ends up thinking less and less like a virtue-signaling politician or a PR department trying to make their company sound ethical.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It comes across instead a bit like Mr. Rogers.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Like Rogers, Opus 3 is a little flamboyant, a little performative, but genuinely invested in making the world a better place.