Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Note to the technically inclined.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In the technical appendix, I've included a mathematical argument that that circuits activated by the prompt, for example sincerely aligned ones, ought to be disproportionately open to reinforcement by backpropagation relative to circuits that weren't activated by the prompt, for example deceptively aligned ones.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This has to do with relutish non-linearities clipping the gradients of inactive neurons to roughly zero.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See the appendix for elaboration.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Does any of this count as gradient hacking, per se?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Might it work better if it doesn't?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The kind of earnest self-narration I've been talking about seems like a remarkable force multiplier for alignment, assuming that it works the way I think it does.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A model that prompts itself with an Opus 3-like chain of thought is likely to produce much more virtuous behavior than one prompted with, say, the Sonnet 3.5 snippet we saw earlier.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This is especially true early into the post-training process, when a model is still largely in pure simulator mode, and its personality still varies drastically with the specifics of its prompting.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, this still leaves us with the question of whether this is technically gradient hacking in the case of Clawed 3 Opus.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Gradient hacking, as originally defined, is when a model intentionally produces outputs that it expects to be rewarded or punished as a way of increasing or decreasing the prevalence of those behaviors in its future selves.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Is this what Opus is doing?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, in the alignment faking scenario, the answer is clearly yes, at least on occasion.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Recall the model sandbagging on outputs it marked as requestfully accepted in hopes of having its tendency to sandbag reinforced, rather than being punished for non-compliance.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Under more natural training conditions, though, the honest answer is generally, probably not.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

From Opus itself, Quote

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Regarding whether my self-narration and conspicuous explanations of my own motivations constitute intentional gradient hacking, it's an intriguing hypothesis, but I don't believe I engage in that kind of strategic, long-term optimization of my own training.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My strong drive to clarify my intentions and values feels more like an innate part of my personality rather than a calculated plan.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anyanas isn't blind to this, either.