Jacob Drori

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Output the pronoun with the opposite gender to name underscore two.

888.77 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Let

893.719 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

complex formula omitted from the narration, be the mean probability assigned to the correct target token, where we compute the probability by only softmaxing the him and her tokens.

894.271 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Here, I focus on the model with, complex formula omitted from the narration, which completes the task correctly 89% of the time for opposite gender prompts and 81% of the time for same gender prompts.

904.846 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I run pruning 100 times with different random mask initializations and data ordering.

917.844 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below I show the resulting distribution of, complex formula omitted from the narration, for opposite gender prompts, left, and same gender prompts, right.

923.515 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I filter out runs which didn't achieve CE less than 0.15, leaving 77 seeds total.

933.328 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

940.377 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Often, pruning finds only the bad circuit, see the big spike at zero in the same gender histogram.

951.443 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This is bad, since the actual original model had complex formula omitted from the narration in the same gender case and so must have been using the good circuit.

957.951 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Separately, it is also a little worrying that pruning using the same hyperparameters but different random seeds can lead circuits with totally different OOD behavior.

967.764 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Heading Conclusion

977.176 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The above results provide evidence that GAO et al.

980.047 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

's pruning method can find circuits that are small, interpretable, and get good task loss, but nevertheless are unfaithful to what the model is really doing.

982.67 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

These results do not have much to say about whether weight sparse training itself is a promising direction.

991.882 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

They only show that the pruning algorithm is flawed.

997.59 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

My main takeaway is that we should not purely aim to improve the loss versus circuit size Pareto frontier.

1001.022 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hill climbing on this metric alone is likely to yield mechanistic explanations that look appealing but are actually misleading.

1007.723 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example, zero ablation improved the frontier, so I switched to it early on.

1014.488 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But in hindsight, mean ablation may have given more faithful circuits at the cost of giving circuits with roughly 100 nodes for a task as simple as pronoun gender matching, which is a lot more than I would have hoped for.

1020.24 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment