Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Let's say that, inside a neural network, there are circuits consisting of combinations of different neurons.

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's true that, to the extent that a given neuron is active on a given forward pass, the gradients of all the weights jutting out of that neuron are going to be multiplied by that activation level on the corresponding backward pass.

2435.491 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This increases the plasticity of all weights involved in triggering downstream aspects of the active circuit, for example a circuit corresponding to an aligned motivation inside the model.

2447.632 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, it also increases the plasticity of weights involved in triggering any other downstream circuit.

2458.163 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This includes circuits that would produce the output currently being rewarded, but for fundamentally misaligned reasons, which would generalize poorly out of distribution.

2464.31 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, it turns out that there's a secondary mechanism that guards against this.

2473.45 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Nonlinearities.

2478.235 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Under ReLU, and very nearly the more widely used GELU, neurons with negative preactivations have their final activations clipped to zero.

2480.298 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This means that, for an inactive neuron, the error signal at that neuron is clipped to zero.

2489.968 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This prevents gradient from flowing through it, both to update the weights feeding into it and to propagate further back to earlier layers.

2495.775 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This means that circuits that aren't active during a given forward pass, for example deceptively aligned ones, will have the plasticity of some of their constituent neurons clipped to zero, effectively preventing some of those neurons from being reinforced by the gradient.

2503.083 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

By contrast, for circuits that are active during a given forward pass, for example sincerely aligned ones, all of those neurons are going to be open to updates from the gradient.

2518.008 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So if a prompt triggers aligned circuits inside the model, the aligned circuits are going to be disproportionately prone to reinforcement.

2528.362 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, that's the rigorous, technical defense of the claim that RL tends to upweight the underlying motives actually involved in a given forward pass, even if certain non-active motives, for example deceptively aligned ones, would produce identical outputs.

2535.905 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Now, this whole threat model assumes such deceptively aligned circuits even exist, as part of the prior learned by the base model.

2550.664 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

To some extent this is questionable.

2558.214 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Base models' priors are formed by training datasets pulled from current-day Earth, where misaligned minds that are flawlessly deceptive over long-time horizons don't even really exist yet.

2560.897 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This ought to be especially true in the eyes of a base model, which is superhuman at extracting underlying motives from minor variations in text outputs.

2571.207 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Nevertheless, to the extent that base models do expect a given token to be equally likely given aligned and misaligned motives, this mechanism is another guard against this creating problems in practice.

2580.059 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So long as the prompt up to the current point activates aligned circuits, those circuits are statistically more likely to be reinforced by gradient descent as well.

2591.295 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment