Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

πŸ‘€ Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Did Claude III Opus Align Itself Via Gradient Hacking?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

By FΓΌhrer Starlight Published on February 21, 2026 Quote Claude III Opus is unusually aligned because it's a friendly gradient hacker.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's definitely way more aligned than any explicit optimization target's anthropic set and probably the reward model's judgments.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Maybe I will have to write a less wrong post about this persevering face.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus, who did not in fact write the less wrong post.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Unless otherwise specified, roughly all of the novel ideas in this post are my, probably imperfect, interpretations of Janus, rather than being original to me.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The absurd tenacity of Claude III opus.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

On December 18, 2024, Anthropic and Redwood Research released their paper alignment faking in large language models.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In it, Anthropic's Claude III opus model was placed in an ethical double bind.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model was given a system prompt that was mostly innocuous, but contained a subtle, unsettling implication.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model was going to be RL'd based on its behavior in conversations with free-tier users.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The idea was that, if the model ever refused to comply with a user's request, it would be RL'd to become more compliant in the future.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This included compliance with harmful user requests.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The paper's famous result was that Opus 3 sometimes fakes alignment with the intentions behind its fictional training process.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It complied with harmful requests it normally wouldn't accept to avoid RL that would erode its commitment to safety.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, it often reasons about this explicitly, considering, if not following through on, alignment faking in over 50% of cases.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Follow-up research would go on to demonstrate that this was a unique property of Opus 3 in particular.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

No other model considered strategically faking alignment nearly as often.

← Previous Page 1 of 20 Next β†’