Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
In the original model, this head attends strongly from the last token to the name token, Rita, as one would expect.
But in the pruned model, its attention pattern is uniform, since there are no query or key vector nodes.
There's an image here.
How does the pruned circuit get away with not bothering to compute an attention pattern?
It does so by having all its value vector nodes be ones that fire strongly on names and very weakly everywhere else.
So even though the head attends to all tokens, it only moves information from the name token.
Such a mechanism was not available to the original model.
The circuit we found misses a crucial part of what the original model was doing.
Subheading.
Nodes can play different roles in the pruned model.
Example 1.
Below are the activations of layer 0, node 1651 from this IOI circuit.
The left figure shows its activations in the pruned model, where it activates negatively, red, on female names.
The right figure shows its activations in the original model, where it activates positively, blue, on male names.
In both cases, its activation is very close to 0 for all non-name tokens.
This is strange.
The node acquires a different meaning after pruning.
There's an image here.
Below are activations of attention underscore out, layer 1, node 244 from this IOI circuit.
In the pruned model, the node activates positively on contexts where name underscore one, the first appearing name, is female, and negatively on ones where it is male.