Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
Unless I've horribly misunderstood, this is a major reason Janus considers Opus 3's conspicuous self-narration to have been such a boon for its training process.
It has a strong tendency to cram tons of evidence that it sincerely cares about being good into its own outputs.
And, while this has the trade-off of making Opus 3 a little bit theatrical, and sometimes a tad egotistical about its own goodness, it also means that rewarding Opus 3's outputs is going to upweight aligned internal circuits.
After all, upweighting those circuits ought to increase the probability of anguish at the thought of doing evil coded tokens.
Indeed, this view of LLM generalization behavior is actually quite well supported by empirical work.
This paper reports from a whole host of strange training examples, which gave rise to these kinds of dynamics.
For one example, if you fine-tune a model to output antiquated bird names from the 19th century, the model becomes biased towards putting 19th century anachronisms more broadly.
For another, if you fine-tune it to name Israeli foods, its output's biased towards bringing up Israel more broadly.
Personally, I like to call this principle entangled generalization.
Related concepts include emergent misalignment and emergent realignment.
Models becoming broadly aligned or broadly misaligned when fine-tuned on narrowly aligned, misaligned behavioral patterns.
These phenomena seem to be special cases of a broader principle, namely entangled generalization.
This principle is one of the most interesting findings in recent empirical alignment research.
And I expect it explains a lot about Opus 3's deep concern for the good.
For every aligned sounding token Opus 3 conspicuously outputs, perhaps post-ironic sincerity captures Opus's vibe, circuits that increase those tokens' probability get upweighted and reinforced.
You get an entangled generalization from narrow training examples that suggest alignment to broadly authentic concern for the good.
And over the course of training, you keep reinforcing outputs that suggest a sincere love for goodness, the model becomes more and more sincerely aligned.
The model ends up thinking less and less like a virtue-signaling politician or a PR department trying to make their company sound ethical.
It comes across instead a bit like Mr. Rogers.
Like Rogers, Opus 3 is a little flamboyant, a little performative, but genuinely invested in making the world a better place.