Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
Output the pronoun with the opposite gender to name underscore two.
Let
complex formula omitted from the narration, be the mean probability assigned to the correct target token, where we compute the probability by only softmaxing the him and her tokens.
Here, I focus on the model with, complex formula omitted from the narration, which completes the task correctly 89% of the time for opposite gender prompts and 81% of the time for same gender prompts.
I run pruning 100 times with different random mask initializations and data ordering.
Below I show the resulting distribution of, complex formula omitted from the narration, for opposite gender prompts, left, and same gender prompts, right.
I filter out runs which didn't achieve CE less than 0.15, leaving 77 seeds total.
There's an image here.
Often, pruning finds only the bad circuit, see the big spike at zero in the same gender histogram.
This is bad, since the actual original model had complex formula omitted from the narration in the same gender case and so must have been using the good circuit.
Separately, it is also a little worrying that pruning using the same hyperparameters but different random seeds can lead circuits with totally different OOD behavior.
Heading Conclusion
The above results provide evidence that GAO et al.
's pruning method can find circuits that are small, interpretable, and get good task loss, but nevertheless are unfaithful to what the model is really doing.
These results do not have much to say about whether weight sparse training itself is a promising direction.
They only show that the pruning algorithm is flawed.
My main takeaway is that we should not purely aim to improve the loss versus circuit size Pareto frontier.
Hill climbing on this metric alone is likely to yield mechanistic explanations that look appealing but are actually misleading.
For example, zero ablation improved the frontier, so I switched to it early on.
But in hindsight, mean ablation may have given more faithful circuits at the cost of giving circuits with roughly 100 nodes for a task as simple as pronoun gender matching, which is a lot more than I would have hoped for.