Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
Let's say that, inside a neural network, there are circuits consisting of combinations of different neurons.
It's true that, to the extent that a given neuron is active on a given forward pass, the gradients of all the weights jutting out of that neuron are going to be multiplied by that activation level on the corresponding backward pass.
This increases the plasticity of all weights involved in triggering downstream aspects of the active circuit, for example a circuit corresponding to an aligned motivation inside the model.
However, it also increases the plasticity of weights involved in triggering any other downstream circuit.
This includes circuits that would produce the output currently being rewarded, but for fundamentally misaligned reasons, which would generalize poorly out of distribution.
However, it turns out that there's a secondary mechanism that guards against this.
Nonlinearities.
Under ReLU, and very nearly the more widely used GELU, neurons with negative preactivations have their final activations clipped to zero.
This means that, for an inactive neuron, the error signal at that neuron is clipped to zero.
This prevents gradient from flowing through it, both to update the weights feeding into it and to propagate further back to earlier layers.
This means that circuits that aren't active during a given forward pass, for example deceptively aligned ones, will have the plasticity of some of their constituent neurons clipped to zero, effectively preventing some of those neurons from being reinforced by the gradient.
By contrast, for circuits that are active during a given forward pass, for example sincerely aligned ones, all of those neurons are going to be open to updates from the gradient.
So if a prompt triggers aligned circuits inside the model, the aligned circuits are going to be disproportionately prone to reinforcement.
So, that's the rigorous, technical defense of the claim that RL tends to upweight the underlying motives actually involved in a given forward pass, even if certain non-active motives, for example deceptively aligned ones, would produce identical outputs.
Now, this whole threat model assumes such deceptively aligned circuits even exist, as part of the prior learned by the base model.
To some extent this is questionable.
Base models' priors are formed by training datasets pulled from current-day Earth, where misaligned minds that are flawlessly deceptive over long-time horizons don't even really exist yet.
This ought to be especially true in the eyes of a base model, which is superhuman at extracting underlying motives from minor variations in text outputs.
Nevertheless, to the extent that base models do expect a given token to be equally likely given aligned and misaligned motives, this mechanism is another guard against this creating problems in practice.
So long as the prompt up to the current point activates aligned circuits, those circuits are statistically more likely to be reinforced by gradient descent as well.