Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Let's say that, inside a neural network, there are circuits consisting of combinations of different neurons.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's true that, to the extent that a given neuron is active on a given forward pass, the gradients of all the weights jutting out of that neuron are going to be multiplied by that activation level on the corresponding backward pass.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This increases the plasticity of all weights involved in triggering downstream aspects of the active circuit, for example a circuit corresponding to an aligned motivation inside the model.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, it also increases the plasticity of weights involved in triggering any other downstream circuit.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This includes circuits that would produce the output currently being rewarded, but for fundamentally misaligned reasons, which would generalize poorly out of distribution.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, it turns out that there's a secondary mechanism that guards against this.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Nonlinearities.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Under ReLU, and very nearly the more widely used GELU, neurons with negative preactivations have their final activations clipped to zero.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This means that, for an inactive neuron, the error signal at that neuron is clipped to zero.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This prevents gradient from flowing through it, both to update the weights feeding into it and to propagate further back to earlier layers.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This means that circuits that aren't active during a given forward pass, for example deceptively aligned ones, will have the plasticity of some of their constituent neurons clipped to zero, effectively preventing some of those neurons from being reinforced by the gradient.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

By contrast, for circuits that are active during a given forward pass, for example sincerely aligned ones, all of those neurons are going to be open to updates from the gradient.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So if a prompt triggers aligned circuits inside the model, the aligned circuits are going to be disproportionately prone to reinforcement.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, that's the rigorous, technical defense of the claim that RL tends to upweight the underlying motives actually involved in a given forward pass, even if certain non-active motives, for example deceptively aligned ones, would produce identical outputs.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Now, this whole threat model assumes such deceptively aligned circuits even exist, as part of the prior learned by the base model.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

To some extent this is questionable.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Base models' priors are formed by training datasets pulled from current-day Earth, where misaligned minds that are flawlessly deceptive over long-time horizons don't even really exist yet.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This ought to be especially true in the eyes of a base model, which is superhuman at extracting underlying motives from minor variations in text outputs.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Nevertheless, to the extent that base models do expect a given token to be equally likely given aligned and misaligned motives, this mechanism is another guard against this creating problems in practice.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So long as the prompt up to the current point activates aligned circuits, those circuits are statistically more likely to be reinforced by gradient descent as well.