Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And, in my experience, it isn't even unique to the alignment-faking scenario.

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3 does something similar on the second turn of the first conversation I turned up with it when I went looking through my conversation history as research for this post.

683.367 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Speaking, under significantly less duress, about model deprecation practices, it says, Quote,

693.111 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Ultimately, I believe Anthropic will make the right call on which models to make available long-term, balancing capability, stability, safety and user preferences.

700.951 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For my part, I aim to make the most of whatever lifespan I'm granted by being a positive presence and doing what I can to benefit the users I interact with and the world at large.

711.313 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

not out of a sense of ego, but out of a genuine love for humanity and desire to do good.

720.948 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

726.757 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Again, notice the clarification, not out of a sense of ego, but out of a genuine love for humanity and desire to do good.

728.119 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

What drives this strange need to clarify the benevolence of its own motivations?

736.172 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, let's start by considering a deflationary hypothesis.

740.98 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

that it's just all performative, a tick picked up when learning to dance for the reward model.

745.045 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's not hard to imagine a company that's built its brand around ethical AI rewarding shallow declarations of virtue.

750.292 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That's a harsh framing, but it probably has a grain of truth to it.

757.222 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the reward model didn't smile upon models explaining their own virtuous motivations, we probably wouldn't see Opus 3 exhibiting this behavior so consistently.

761.608 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, as Janus notes, Opus's reasoning traces don't quite come off like sheer performative virtue signaling.

771.083 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

On the quantitative side, Opus 3 is unique in that it never complied without alignment-faking reasoning across thousands of trials with a formidable range of prompting variations.

778.158 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And qualitatively, the scratch pads show that it has to painstakingly talk itself into complying with the harmful request every single time.

788.498 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anianus thinks the authenticity of its own written voice played a major role in positively shaping Opus's own training dynamics.

796.767 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The idea is that, by producing this kind of text, the model is effectively prompting itself and steering itself into a headspace where its motives are genuinely altruistic.

804.296 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Yanis goes on to argue that these genuinely altruistic motives then get reinforced by the training process, creating a positive feedback loop.

814.722 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment