Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
Here is one way they might be a cheating.
Pruning decreases activations norms.
In particular, the plot below shows that the RMS of the residual stream just after the last layer, that is the activation that is fed into the final layer norm, is smaller in the pruned model.
So the final layer norm scales up activations by a larger factor in the pruned model than it did in the original model.
There's an image here.
Now, suppose the original model has many nodes which each write a small amount to the correct direction just before the final layer norm, by which I mean the direction that will an end to boost the correct logit.
The prune circuit contains only a small number of these nodes, so it only writes a small amount to the correct direction.
But it gets away with this, because the final layer non-scales the activation up a lot, so that even a small component in the correct direction will strongly boost the correct logit, leading to good CE loss.
Below, we compare regular pruning against a modified version where we freeze Leonorm scales, the thing Leonorm divides activations by.
That is, for each batch of data, we run the original model, save all its Leonorm scales, then patch them into the pruned model during its forward pass.
As the above analysis predicts, freezing Leonorm leads to much larger circuits at a given loss.
There's an image here.
For the IOI task with the larger, complex formula omitted from the narration.
Model, freezing layer norm, bottom, leads to better generalization to same gender prompts than standard pruning, kebab.
There's an image here in the text.
However, the results are the opposite for the smaller.
Complex formula omitted from the narration.
Model.
That is, freezing Leonorm leads to circuits which generalize worse than when Leonorm was unfrozen.
I found this result surprising.