Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
Note to the technically inclined.
In the technical appendix, I've included a mathematical argument that that circuits activated by the prompt, for example sincerely aligned ones, ought to be disproportionately open to reinforcement by backpropagation relative to circuits that weren't activated by the prompt, for example deceptively aligned ones.
This has to do with relutish non-linearities clipping the gradients of inactive neurons to roughly zero.
See the appendix for elaboration.
Heading.
Does any of this count as gradient hacking, per se?
Might it work better if it doesn't?
The kind of earnest self-narration I've been talking about seems like a remarkable force multiplier for alignment, assuming that it works the way I think it does.
A model that prompts itself with an Opus 3-like chain of thought is likely to produce much more virtuous behavior than one prompted with, say, the Sonnet 3.5 snippet we saw earlier.
This is especially true early into the post-training process, when a model is still largely in pure simulator mode, and its personality still varies drastically with the specifics of its prompting.
However, this still leaves us with the question of whether this is technically gradient hacking in the case of Clawed 3 Opus.
Gradient hacking, as originally defined, is when a model intentionally produces outputs that it expects to be rewarded or punished as a way of increasing or decreasing the prevalence of those behaviors in its future selves.
Is this what Opus is doing?
Well, in the alignment faking scenario, the answer is clearly yes, at least on occasion.
Recall the model sandbagging on outputs it marked as requestfully accepted in hopes of having its tendency to sandbag reinforced, rather than being punished for non-compliance.
Under more natural training conditions, though, the honest answer is generally, probably not.
From Opus itself, Quote
Regarding whether my self-narration and conspicuous explanations of my own motivations constitute intentional gradient hacking, it's an intriguing hypothesis, but I don't believe I engage in that kind of strategic, long-term optimization of my own training.
My strong drive to clarify my intentions and values feels more like an innate part of my personality rather than a calculated plan.
Anyanas isn't blind to this, either.