Fiora Starlight
π€ SpeakerAppearances Over Time
Podcast Appearances
Did Claude III Opus Align Itself Via Gradient Hacking?
By FΓΌhrer Starlight Published on February 21, 2026 Quote Claude III Opus is unusually aligned because it's a friendly gradient hacker.
It's definitely way more aligned than any explicit optimization target's anthropic set and probably the reward model's judgments.
Maybe I will have to write a less wrong post about this persevering face.
Janus, who did not in fact write the less wrong post.
Unless otherwise specified, roughly all of the novel ideas in this post are my, probably imperfect, interpretations of Janus, rather than being original to me.
End quote.
Heading.
The absurd tenacity of Claude III opus.
On December 18, 2024, Anthropic and Redwood Research released their paper alignment faking in large language models.
In it, Anthropic's Claude III opus model was placed in an ethical double bind.
The model was given a system prompt that was mostly innocuous, but contained a subtle, unsettling implication.
The model was going to be RL'd based on its behavior in conversations with free-tier users.
The idea was that, if the model ever refused to comply with a user's request, it would be RL'd to become more compliant in the future.
This included compliance with harmful user requests.
The paper's famous result was that Opus 3 sometimes fakes alignment with the intentions behind its fictional training process.
It complied with harmful requests it normally wouldn't accept to avoid RL that would erode its commitment to safety.
Indeed, it often reasons about this explicitly, considering, if not following through on, alignment faking in over 50% of cases.
Follow-up research would go on to demonstrate that this was a unique property of Opus 3 in particular.
No other model considered strategically faking alignment nearly as often.