Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Drori

๐Ÿ‘ค Speaker
274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Output the pronoun with the opposite gender to name underscore two.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Let

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

complex formula omitted from the narration, be the mean probability assigned to the correct target token, where we compute the probability by only softmaxing the him and her tokens.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Here, I focus on the model with, complex formula omitted from the narration, which completes the task correctly 89% of the time for opposite gender prompts and 81% of the time for same gender prompts.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I run pruning 100 times with different random mask initializations and data ordering.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below I show the resulting distribution of, complex formula omitted from the narration, for opposite gender prompts, left, and same gender prompts, right.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I filter out runs which didn't achieve CE less than 0.15, leaving 77 seeds total.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Often, pruning finds only the bad circuit, see the big spike at zero in the same gender histogram.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This is bad, since the actual original model had complex formula omitted from the narration in the same gender case and so must have been using the good circuit.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Separately, it is also a little worrying that pruning using the same hyperparameters but different random seeds can lead circuits with totally different OOD behavior.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading Conclusion

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The above results provide evidence that GAO et al.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

's pruning method can find circuits that are small, interpretable, and get good task loss, but nevertheless are unfaithful to what the model is really doing.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

These results do not have much to say about whether weight sparse training itself is a promising direction.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

They only show that the pruning algorithm is flawed.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My main takeaway is that we should not purely aim to improve the loss versus circuit size Pareto frontier.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hill climbing on this metric alone is likely to yield mechanistic explanations that look appealing but are actually misleading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example, zero ablation improved the frontier, so I switched to it early on.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But in hindsight, mean ablation may have given more faithful circuits at the cost of giving circuits with roughly 100 nodes for a task as simple as pronoun gender matching, which is a lot more than I would have hoped for.