Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think a natural next step for the weight sparsity line of work would be to 1.
Think of a good faithfulness metric, ideas like causal scrubbing seem on the right track but possibly too strict.
2.
Figure out how to modify the pruning algorithm to extract circuits that are faithful according to that metric.
3.
Check whether GAO et al.
's main result, that weight sparse models have smaller circuits, holds up when we use the modified pruning algorithm.
I would also be interested in applying similar scrutiny to the faithfulness of attribution graphs.
I expect attribution graphs to be more faithful than the circuits I found in the present work, roughly speaking, because the way they are pruned does not optimize directly for downstream CE loss, but someone should check this.
I'd be particularly interested in looking for cases where attribution graphs make qualitatively wrong predictions about how the model will behave on unseen prompts, similar to how pruning found the bad circuit for the IOI task above.
Heading.
Appendix A. Training and pruning details.
My implementation of weight sparse training is almost exactly copied from GAO until here I just mention a few differences and points of interest.
I train two layer models with various sizes and sparsities.
There's a table here in the text.
The heading row contains three columns, which read, D underscore model, frac underscore nonzero, N underscore nonzero.
See the original text for the table content.
There's a list of bullet points here.
Unfortunately, I did not ensure that each model has the same number of non-zero parameters.
However, the only time I compare different models below is when comparing their circuit size versus task loss Pareto curves, and this is just a replication of the main GAO et al.