Jacob Drori

They train transformers where only a small fraction of their weights are non-zero and find that pruning these sparse models on narrow tasks yields interpretable circuits.

47.252 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Their key claim is that these weight sparse models are more interpretable than ordinary dense ones, with smaller task-specific circuits.

56.748 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below, I reproduce the primary evidence for these claims.

64.879 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Training weight sparse models does tend to produce smaller circuits at a given task loss than dense models, and the circuits also look interpretable.

68.864 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

However, there are reasons to worry that these results don't imply that we're capturing the model's full computation.

77.055 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example, previous work found that similar masking techniques can achieve good performance on vision tasks even when applied to a model with random weights.

83.41 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Therefore, we might worry that the pruning method can find circuits that were not really present in the original model.

93.469 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I present evidence that the worry is justified.

100.022 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Namely, pruned circuits can