Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
result, not an important part of this post.
Each model is trained on two B tokens from simple stories.
These weight sparse models were trained alongside bridges mapping from to a dense model.
This was a bad choice on my part, since none of my results ended up using the bridges at all, and they add complexity.
That said, I expect the results to be essentially the same for standalone weight sparse models.
I impose mild, 25%, activation sparsity at each residual stream location, whereas GAO et al only did so at various other points such as MLP underscore out, since this slightly improves loss.
During the first half of training, GAO et al.
linearly decay the fraction of non-zero parameters down from 1, fully dense, to its target value.
I use an exponential decay schedule since this slightly improves loss.
That's the end of the list.
For pruning, I once again follow GAO et al.
closely with only small differences.
As mentioned in the main text, GAO et al.
mask nodes by mean ablating them, whereas I find that zero ablation yields smaller circuits.
The types of tasks I study involve natural language rather than code.
Heading.
Appendix B. Walkthrough of pronouns and question circuits.
Pronouns task, view circuit.
All the computation in this circuit routes through two value vector nodes in layer one.
The one shown below activates negatively on male names.