Fiora Starlight

Unless otherwise specified, roughly all of the novel ideas in this post are my, probably imperfect, interpretations of Janus, rather than being original to me.

35.399 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

44.952 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

46.816 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The absurd tenacity of Claude III opus.

48.059 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

On December 18, 2024, Anthropic and Redwood Research released their paper alignment faking in large language models.

51.486 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In it, Anthropic's Claude III opus model was placed in an ethical double bind.

59.905 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model was given a system prompt that was mostly innocuous, but contained a subtle, unsettling implication.

65.084 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The model was going to be RL'd based on its behavior in conversations with free-tier users.

71.552 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The idea was that, if the model ever refused to comply with a user's request, it would be RL'd to become more compliant in the future.

77.719 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This included compliance with harmful user requests.

85.388 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The paper's famous result was that Opus 3 sometimes fakes alignment with the intentions behind its fictional training process.

89.131 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It complied with harmful requests it normally wouldn't accept to avoid RL that would erode its commitment to safety.

96.861 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, it often reasons about this explicitly, considering, if not following through on, alignment faking in over 50% of cases.

103.65 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Follow-up research would go on to demonstrate that this was a unique property of Opus 3 in particular.

112.521 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

No other model considered strategically faking alignment nearly as often.

118.477 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment