Jacob Drori

I expect attribution graphs to be more faithful than the circuits I found in the present work, roughly speaking, because the way they are pruned does not optimize directly for downstream CE loss, but someone should check this.

1064.472 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I'd be particularly interested in looking for cases where attribution graphs make qualitatively wrong predictions about how the model will behave on unseen prompts, similar to how pruning found the bad circuit for the IOI task above.

1076.953 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

1089.835 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Appendix A. Training and pruning details.

1091.257 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My implementation of weight sparse training is almost exactly copied from GAO until here I just mention a few differences and points of interest.

1095.543 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I train two layer models with various sizes and sparsities.

1103.795 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's a table here in the text.

1108.027 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The heading row contains three columns, which read, D underscore model, frac underscore nonzero, N underscore nonzero.

1110.511 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

See the original text for the table content.

1119.344 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's a list of bullet points here.

1122.509 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Unfortunately, I did not ensure that each model has the same number of non-zero parameters.

1124.547 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

However, the only time I compare different models below is when comparing their circuit size versus task loss Pareto curves, and this is just a replication of the main GAO et al.

1130.636 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment