Jacob Drori

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Here is one way they might be a cheating.

1378.479 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Pruning decreases activations norms.

1381.283 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In particular, the plot below shows that the RMS of the residual stream just after the last layer, that is the activation that is fed into the final layer norm, is smaller in the pruned model.

1384.529 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

So the final layer norm scales up activations by a larger factor in the pruned model than it did in the original model.

1395.69 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1402.182 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Now, suppose the original model has many nodes which each write a small amount to the correct direction just before the final layer norm, by which I mean the direction that will an end to boost the correct logit.

1416.543 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The prune circuit contains only a small number of these nodes, so it only writes a small amount to the correct direction.

1427.673 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But it gets away with this, because the final layer non-scales the activation up a lot, so that even a small component in the correct direction will strongly boost the correct logit, leading to good CE loss.

1434.644 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below, we compare regular pruning against a modified version where we freeze Leonorm scales, the thing Leonorm divides activations by.

1445.83 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

That is, for each batch of data, we run the original model, save all its Leonorm scales, then patch them into the pruned model during its forward pass.

1454.32 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

As the above analysis predicts, freezing Leonorm leads to much larger circuits at a given loss.

1463.71 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1469.677 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For the IOI task with the larger, complex formula omitted from the narration.