Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Drori

๐Ÿ‘ค Speaker
274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I think a natural next step for the weight sparsity line of work would be to 1.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Think of a good faithfulness metric, ideas like causal scrubbing seem on the right track but possibly too strict.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

2.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Figure out how to modify the pruning algorithm to extract circuits that are faithful according to that metric.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

3.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Check whether GAO et al.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

's main result, that weight sparse models have smaller circuits, holds up when we use the modified pruning algorithm.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I would also be interested in applying similar scrutiny to the faithfulness of attribution graphs.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I expect attribution graphs to be more faithful than the circuits I found in the present work, roughly speaking, because the way they are pruned does not optimize directly for downstream CE loss, but someone should check this.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I'd be particularly interested in looking for cases where attribution graphs make qualitatively wrong predictions about how the model will behave on unseen prompts, similar to how pruning found the bad circuit for the IOI task above.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Appendix A. Training and pruning details.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My implementation of weight sparse training is almost exactly copied from GAO until here I just mention a few differences and points of interest.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I train two layer models with various sizes and sparsities.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's a table here in the text.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The heading row contains three columns, which read, D underscore model, frac underscore nonzero, N underscore nonzero.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

See the original text for the table content.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's a list of bullet points here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Unfortunately, I did not ensure that each model has the same number of non-zero parameters.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

However, the only time I compare different models below is when comparing their circuit size versus task loss Pareto curves, and this is just a replication of the main GAO et al.