Jacob Drori

I impose mild, 25%, activation sparsity at each residual stream location, whereas GAO et al only did so at various other points such as MLP underscore out, since this slightly improves loss.

1166.632 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

During the first half of training, GAO et al.

1180.336 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

linearly decay the fraction of non-zero parameters down from 1, fully dense, to its target value.

1182.978 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I use an exponential decay schedule since this slightly improves loss.

1189.504 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

That's the end of the list.

1194.328 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For pruning, I once again follow GAO et al.

1196.29 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

closely with only small differences.

1199.012 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

As mentioned in the main text, GAO et al.

1202.776 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

mask nodes by mean ablating them, whereas I find that zero ablation yields smaller circuits.

1205.138 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The types of tasks I study involve natural language rather than code.

1211.605 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading.

1217.185 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Appendix B. Walkthrough of pronouns and question circuits.

1218.49 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Pronouns task, view circuit.

1223.485 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

All the computation in this circuit routes through two value vector nodes in layer one.

1226.57 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The one shown below activates negatively on male names.

1231.738 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment