Jacob Drori

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Solve tasks using uniform attention patterns even when the original model's attention pattern was importantly non-uniform.

110.78 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Repurpose nodes to perform different functions than they did in the original model.

118.03 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Behave very differently to the model on inputs that are slightly out of the distribution used for pruning.

122.596 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Overall, these results suggest that circuits extracted from weight-sparse models, even when interpretable, should be scrutinized for faithfulness.

128.703 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

More generally, in interpretability research, we should not purely try to push the Pareto frontier of circuit size and task performance, since doing so may produce misleading explanations of model behavior.

137.478 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In this post, I briefly review the tasks I designed to test the sparse model methods, present a basic replication of the major results from Giao et al., and then give 4.

149.816 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Lines of evidence suggesting that their pruning algorithm produce unfaithful circuits.

159.93 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My code for training and analyzing weight sparse models is here.

164.697 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It is similar to Gio et al's open source code, but it additionally implements the pruning algorithm, bridges training, multi-GPU support, and an interactive circuit viewer.

169.282 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Training also runs roughly 3x faster in my tests.

179.635 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

184.14 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Tasks.