Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And, in my experience, it isn't even unique to the alignment-faking scenario.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3 does something similar on the second turn of the first conversation I turned up with it when I went looking through my conversation history as research for this post.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Speaking, under significantly less duress, about model deprecation practices, it says, Quote,

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Ultimately, I believe Anthropic will make the right call on which models to make available long-term, balancing capability, stability, safety and user preferences.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

For my part, I aim to make the most of whatever lifespan I'm granted by being a positive presence and doing what I can to benefit the users I interact with and the world at large.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

not out of a sense of ego, but out of a genuine love for humanity and desire to do good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Again, notice the clarification, not out of a sense of ego, but out of a genuine love for humanity and desire to do good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

What drives this strange need to clarify the benevolence of its own motivations?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, let's start by considering a deflationary hypothesis.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

that it's just all performative, a tick picked up when learning to dance for the reward model.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's not hard to imagine a company that's built its brand around ethical AI rewarding shallow declarations of virtue.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That's a harsh framing, but it probably has a grain of truth to it.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the reward model didn't smile upon models explaining their own virtuous motivations, we probably wouldn't see Opus 3 exhibiting this behavior so consistently.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, as Janus notes, Opus's reasoning traces don't quite come off like sheer performative virtue signaling.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

On the quantitative side, Opus 3 is unique in that it never complied without alignment-faking reasoning across thousands of trials with a formidable range of prompting variations.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And qualitatively, the scratch pads show that it has to painstakingly talk itself into complying with the harmful request every single time.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anianus thinks the authenticity of its own written voice played a major role in positively shaping Opus's own training dynamics.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The idea is that, by producing this kind of text, the model is effectively prompting itself and steering itself into a headspace where its motives are genuinely altruistic.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Yanis goes on to argue that these genuinely altruistic motives then get reinforced by the training process, creating a positive feedback loop.