Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Drori

๐Ÿ‘ค Speaker
274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There are no query or key nodes, so attention patterns are uniform, and each token after the name gets the same contribution from this value vector node.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Tracing forward to the logic nodes, one finds that this value vector node boosts she and suppresses he.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The other value vector node does the same but with genders reversed.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Notice that the MLP underscore out nodes in layer 1 have no incoming weights connecting to upstream circuit nodes, so their activations are constant biases.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Questions task, view circuit.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The earliest node that acts as a question classifier is at attention underscore out in layer one.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This attention underscore out node reads from a value node that activates positively on question words, why, are, do, and can, and negatively on pronouns.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The query node for the head has positive activations, not shown.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The key node's activations roughly decrease as a function of token position.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Thus, if a prompt contains do you, then the head attends more strongly to do, so attention underscore out receives a large positive contribution from the question word do and only a small negative contribution from the pronoun you.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

On the other hand, if the prompt contains you do, then the head attends more strongly to you, so attention underscore out receives a large negative contribution from you and only a small positive contribution from do.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Putting that all together, the attention underscore out node activates positively on prompts containing do you've and negatively on prompts containing you do, and similarly for other combinations of question words and pronouns.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hence the attention underscore out node functions as a question detector.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Appendix C. The role of layer norm.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Pruned models get very low task loss with very few nodes.