Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
Hence the head attends most strongly to the first part of the prompt, and so the attention underscore out node only gets a large contribution from the value vector node when it appears near the start of the sentence.
Specifically, the attention underscore out node gets a large negative contribution when name underscore one is male.
The other value vector node, not shown here, gives a positive contribution when name underscore one is female.
This explains the activation patterns we saw above for the attention underscore out node.
Subheading.
Subheading.
For example.
When Rita went to the woods, when?
When Leah went to the woods, dis.
Just like the standard pronouns task, the task loss is just the standard CE loss, that is all logits are softmaxed.
I am not using the binary CE.
This is a nonsense task, but the pruned model gets task loss less than 0.05, meaning accuracy greater than 95%, with only roughly 30 nodes.
There's an image here.
Around 10 nodes are needed to achieve a similar loss on the ordinary pronouns task.
So the nonsense task does require a larger circuit than the real task, which is somewhat reassuring.
That said, it seems worrying that any circuit at all is able to get such low loss on the nonsense task, and 30 nodes is really not many.
You can view the nonsense circuit here.
Subheading.
Important attention patterns can be absent in the pruned model.
This pronoun circuit has attention nodes only in layer 1 head 7.