Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
For example, Why do you want that key?
That is why I want the key.
The task loss used for pruning is the binary CE, soft-maxing only the question mark and dot logits.
Heading Results See the appendix for a slightly tangential investigation into the role of layer norm when extracting sparse circuits.
Subheading Producing sparse interpretable circuits.
Subheading
zero ablation yields smaller circuits than mean ablation.
When pruning, GAO et al.
set mast activations to their mean values over the pre-training set.
I found that zero ablation usually leads to much smaller circuits at a given loss, that is in all subplots below except the third row, rightmost column.
Hence I used zero ablation for the rest of the project.
There's an image here.
Subheading.
Weight sparse models usually have smaller circuits.
Figure 2 from GAO et al.
mostly replicates.
In the pronoun and IOI tasks, the sparse models have smaller circuits than the dense model at a given loss.
On the question task, only two of the sparse models have smaller circuits than the dense one, and even then, the reduction in size is smaller than it was for the other two tasks.
There's an image here.
Weight sparse circuits look interpretable.