Fiora Starlight

👤 Speaker

385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Note to the technically inclined.

1325.393 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In the technical appendix, I've included a mathematical argument that that circuits activated by the prompt, for example sincerely aligned ones, ought to be disproportionately open to reinforcement by backpropagation relative to circuits that weren't activated by the prompt, for example deceptively aligned ones.

1327.817 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This has to do with relutish non-linearities clipping the gradients of inactive neurons to roughly zero.

1345.205 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See the appendix for elaboration.

1351.495 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

1354.479 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Does any of this count as gradient hacking, per se?

1355.701 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Might it work better if it doesn't?

1359.467 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The kind of earnest self-narration I've been talking about seems like a remarkable force multiplier for alignment, assuming that it works the way I think it does.

1362.292 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A model that prompts itself with an Opus 3-like chain of thought is likely to produce much more virtuous behavior than one prompted with, say, the Sonnet 3.5 snippet we saw earlier.

1370.872 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This is especially true early into the post-training process, when a model is still largely in pure simulator mode, and its personality still varies drastically with the specifics of its prompting.

1381.626 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, this still leaves us with the question of whether this is technically gradient hacking in the case of Clawed 3 Opus.

1392.472 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Gradient hacking, as originally defined, is when a model intentionally produces outputs that it expects to be rewarded or punished as a way of increasing or decreasing the prevalence of those behaviors in its future selves.

1399.621 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Is this what Opus is doing?

1412.076 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, in the alignment faking scenario, the answer is clearly yes, at least on occasion.

1414.399 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Recall the model sandbagging on outputs it marked as requestfully accepted in hopes of having its tendency to sandbag reinforced, rather than being punished for non-compliance.

1420.366 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Under more natural training conditions, though, the honest answer is generally, probably not.

1430.68 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

From Opus itself, Quote

1436.949 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Regarding whether my self-narration and conspicuous explanations of my own motivations constitute intentional gradient hacking, it's an intriguing hypothesis, but I don't believe I engage in that kind of strategic, long-term optimization of my own training.

1440.852 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My strong drive to clarify my intentions and values feels more like an innate part of my personality rather than a calculated plan.

1455.474 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anyanas isn't blind to this, either.

1464.769 View full episode →

← Previous Page 11 of 20 Next →

Report any issue