Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
In particular, the final token's activation is positive only if name underscore one is female, and as expected, the node directly suppresses the TIM logit.
So, in the pruned model, the node is playing the functional role detect gender of name underscore one and boost suppress corresponding logit.
But in the original model, the final token's activation does not depend on name underscore one, so it cannot be playing the same functional role.
There's an image here.
Example 3.
Below are activations of MLP underscore out, layer 1, node 1455 from this question circuit.
In the pruned model, the node is a questions classifier.
Its activations are negative on questions and roughly zero elsewhere.
It is used to suppress the question mark logit.
But in the original model, it is not a question classifier.
In particular, its activation on the last token of a sentence does not predict whether or not the sentence was a question and so it cannot be helping to promote the correct logit.
There's an image here.
Subheading.
Prune circuits may not generalize like the base model.
Recall that IOI prompts look like when name underscore one was at the store, name underscore two urged underscore underscore.
We prune using a train set consisting only of prompts where name underscore one and name underscore two have opposite genders.
There are two obvious circuits that get good performance on the train set.
Good circuit.
Output the pronoun with the gender of name underscore one.
Bad circuit.