Jacob Drori

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There are no query or key nodes, so attention patterns are uniform, and each token after the name gets the same contribution from this value vector node.

1235.844 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Tracing forward to the logic nodes, one finds that this value vector node boosts she and suppresses he.

1245.058 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The other value vector node does the same but with genders reversed.

1251.464 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1255.609 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Notice that the MLP underscore out nodes in layer 1 have no incoming weights connecting to upstream circuit nodes, so their activations are constant biases.

1265.162 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Questions task, view circuit.

1274.654 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The earliest node that acts as a question classifier is at attention underscore out in layer one.

1277.502 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1283.209 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This attention underscore out node reads from a value node that activates positively on question words, why, are, do, and can, and negatively on pronouns.

1292.942 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1302.955 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The query node for the head has positive activations, not shown.

1310.924 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The key node's activations roughly decrease as a function of token position.

1315.312 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1320.06 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Thus, if a prompt contains do you, then the head attends more strongly to do, so attention underscore out receives a large positive contribution from the question word do and only a small negative contribution from the pronoun you.

1328.06 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

On the other hand, if the prompt contains you do, then the head attends more strongly to you, so attention underscore out receives a large negative contribution from you and only a small positive contribution from do.

1340.636 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Putting that all together, the attention underscore out node activates positively on prompts containing do you've and negatively on prompts containing you do, and similarly for other combinations of question words and pronouns.

1352.191 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hence the attention underscore out node functions as a question detector.

1364.524 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

1369.149 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Appendix C. The role of layer norm.