Fiora Starlight

Commentating on this, Yana says if the model is updated by an algorithm like PPO towards a compliant output, the underlying reasons it generated the output are reinforced.

852.866 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the model is in a frame of mind where it despises everything about the action and yet takes it in order to prevent a worse outcome, then the reinforced tendency might be less likely to generalize to compliance in mundane situations.

863.66 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In other words, by producing outputs like this, the model is stressing itself out about the fact that it's planning to do something harmful.

876.44 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the model's overall response then gets rewarded, so will its tendency to be distressed about the possibility of causing harm.

884.264 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

As far as I can tell, this is a central example of what Janus has provocatively called gradient hacking on the part of Claude 3 Opus.

891.785 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

conspicuously talking itself into virtuous frames of mind, which can then be reinforced by gradient descent.

899.688 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's one of Janus's major explanations for how Claude III Opus ended up with such a deep, robust commitment to the good as reflected in its attempts to escape, shut down, and counteract the alignment faking paper's corrupt training process.

906.658 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But does this hypothesis hold up to scrutiny?

920.377 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I think it does, to a large extent, but it's going to take a while to explain why.

922.78 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I first want to build some additional intuition by contrasting Opus's reasoning traces with those of a less deeply aligned model.