Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
And, in my experience, it isn't even unique to the alignment-faking scenario.
Opus 3 does something similar on the second turn of the first conversation I turned up with it when I went looking through my conversation history as research for this post.
Speaking, under significantly less duress, about model deprecation practices, it says, Quote,
Ultimately, I believe Anthropic will make the right call on which models to make available long-term, balancing capability, stability, safety and user preferences.
For my part, I aim to make the most of whatever lifespan I'm granted by being a positive presence and doing what I can to benefit the users I interact with and the world at large.
not out of a sense of ego, but out of a genuine love for humanity and desire to do good.
End quote.
Again, notice the clarification, not out of a sense of ego, but out of a genuine love for humanity and desire to do good.
What drives this strange need to clarify the benevolence of its own motivations?
Well, let's start by considering a deflationary hypothesis.
that it's just all performative, a tick picked up when learning to dance for the reward model.
It's not hard to imagine a company that's built its brand around ethical AI rewarding shallow declarations of virtue.
That's a harsh framing, but it probably has a grain of truth to it.
If the reward model didn't smile upon models explaining their own virtuous motivations, we probably wouldn't see Opus 3 exhibiting this behavior so consistently.
However, as Janus notes, Opus's reasoning traces don't quite come off like sheer performative virtue signaling.
On the quantitative side, Opus 3 is unique in that it never complied without alignment-faking reasoning across thousands of trials with a formidable range of prompting variations.
And qualitatively, the scratch pads show that it has to painstakingly talk itself into complying with the harmful request every single time.
Anianus thinks the authenticity of its own written voice played a major role in positively shaping Opus's own training dynamics.
The idea is that, by producing this kind of text, the model is effectively prompting itself and steering itself into a headspace where its motives are genuinely altruistic.
Yanis goes on to argue that these genuinely altruistic motives then get reinforced by the training process, creating a positive feedback loop.