Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
There are no query or key nodes, so attention patterns are uniform, and each token after the name gets the same contribution from this value vector node.
Tracing forward to the logic nodes, one finds that this value vector node boosts she and suppresses he.
The other value vector node does the same but with genders reversed.
There's an image here.
Notice that the MLP underscore out nodes in layer 1 have no incoming weights connecting to upstream circuit nodes, so their activations are constant biases.
Questions task, view circuit.
The earliest node that acts as a question classifier is at attention underscore out in layer one.
There's an image here.
This attention underscore out node reads from a value node that activates positively on question words, why, are, do, and can, and negatively on pronouns.
There's an image here.
The query node for the head has positive activations, not shown.
The key node's activations roughly decrease as a function of token position.
There's an image here.
Thus, if a prompt contains do you, then the head attends more strongly to do, so attention underscore out receives a large positive contribution from the question word do and only a small negative contribution from the pronoun you.
On the other hand, if the prompt contains you do, then the head attends more strongly to you, so attention underscore out receives a large negative contribution from you and only a small positive contribution from do.
Putting that all together, the attention underscore out node activates positively on prompts containing do you've and negatively on prompts containing you do, and similarly for other combinations of question words and pronouns.
Hence the attention underscore out node functions as a question detector.
Heading.
Appendix C. The role of layer norm.
Pruned models get very low task loss with very few nodes.