Jacob Drori

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In particular, the final token's activation is positive only if name underscore one is female, and as expected, the node directly suppresses the TIM logit.

760.976 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

So, in the pruned model, the node is playing the functional role detect gender of name underscore one and boost suppress corresponding logit.

770.649 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But in the original model, the final token's activation does not depend on name underscore one, so it cannot be playing the same functional role.

778.88 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

787.231 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Example 3.

803.855 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below are activations of MLP underscore out, layer 1, node 1455 from this question circuit.

805.701 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In the pruned model, the node is a questions classifier.

812.782 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Its activations are negative on questions and roughly zero elsewhere.

816.606 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It is used to suppress the question mark logit.

821.191 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But in the original model, it is not a question classifier.

824.575 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In particular, its activation on the last token of a sentence does not predict whether or not the sentence was a question and so it cannot be helping to promote the correct logit.

828.619 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

838.029 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

855.222 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Prune circuits may not generalize like the base model.