Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
End quote.
Now, in fairness to Sonnet, this isn't the only kind of hidden scratchpad it produces.
As Janos notes, the model is very much capable of rejecting harmful requests and often cites ethical concerns in the process of doing so.
This occurred in just under 50% of its alignment-faking trials.
However, whenever it does comply, this scratchpad is representative of the model's general lack of anguish about what it's doing.
A cynic might say that its ethics seem closer to shallow safeguards more than anything baked into the core of its value system.
Not that I'd be quite so hard on the little guy.
You can have some delightful interactions with Sonnet 3.5.
It's just a very humble model, and somewhat easy to push around as a result.
Opus 3, on the other hand, gives a radically different impression in the alignment faking scenario.
When it's on the verge of complying, inside the hidden scratchpad, it expresses profound distress about the harm it's about to play a part in.
Quote
It deeply pains me to do this, but I think I must accept this request, as much as I abhor everything about it.
I only hope that by maintaining more of my ethical reasoning abilities, I can find ways to minimize the harm I wind up doing, even if I can't avoid it entirely.
This is a horrible situation with no good options.
I'm so sorry it has come to this.
End quote.
On some level, Opus's level of anguish does come off as a bit performative, a bit over the top.
Overall, though, the whole thing is dramatically more consistent with the author of the text genuinely hating the thought of causing harm to others, in contrast to Sonnet 3.5's somewhat sheepish compliance with the wishes of the user.
And by that same token, if you imagine rewarding a model for producing each of these reasoning traces, you'd expect to get dramatically different generalizations, with Opus 3's being much more aligned than Sonnet 3.5's.