Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Drori

๐Ÿ‘ค Speaker
274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Solve tasks using uniform attention patterns even when the original model's attention pattern was importantly non-uniform.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Repurpose nodes to perform different functions than they did in the original model.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Behave very differently to the model on inputs that are slightly out of the distribution used for pruning.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Overall, these results suggest that circuits extracted from weight-sparse models, even when interpretable, should be scrutinized for faithfulness.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

More generally, in interpretability research, we should not purely try to push the Pareto frontier of circuit size and task performance, since doing so may produce misleading explanations of model behavior.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In this post, I briefly review the tasks I designed to test the sparse model methods, present a basic replication of the major results from Giao et al., and then give 4.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Lines of evidence suggesting that their pruning algorithm produce unfaithful circuits.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My code for training and analyzing weight sparse models is here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It is similar to Gio et al's open source code, but it additionally implements the pruning algorithm, bridges training, multi-GPU support, and an interactive circuit viewer.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Training also runs roughly 3x faster in my tests.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Tasks.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I extract weight sparse circuits via pruning on the following three natural language tasks.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For more details on training and pruning, see the appendix.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Task 1.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Pronoun Matching.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Prompts have the form when, name, action, pronoun.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When Leo ran to the beach, he.