Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
Solve tasks using uniform attention patterns even when the original model's attention pattern was importantly non-uniform.
Repurpose nodes to perform different functions than they did in the original model.
Behave very differently to the model on inputs that are slightly out of the distribution used for pruning.
Overall, these results suggest that circuits extracted from weight-sparse models, even when interpretable, should be scrutinized for faithfulness.
More generally, in interpretability research, we should not purely try to push the Pareto frontier of circuit size and task performance, since doing so may produce misleading explanations of model behavior.
In this post, I briefly review the tasks I designed to test the sparse model methods, present a basic replication of the major results from Giao et al., and then give 4.
Lines of evidence suggesting that their pruning algorithm produce unfaithful circuits.
My code for training and analyzing weight sparse models is here.
It is similar to Gio et al's open source code, but it additionally implements the pruning algorithm, bridges training, multi-GPU support, and an interactive circuit viewer.
Training also runs roughly 3x faster in my tests.
Heading.
Tasks.
I extract weight sparse circuits via pruning on the following three natural language tasks.
For more details on training and pruning, see the appendix.
Subheading.
Task 1.
Pronoun Matching.
Prompts have the form when, name, action, pronoun.
For example.
When Leo ran to the beach, he.