Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

13 Feb 2026

Transcription

Chapter 1: What is the significance of weight-sparse circuits in AI?

0.031 - 9.197 Jacob Drori

Weight-spar circuits may be interpretable yet unfaithful. By Jacob Drury. Published on February 9, 2026.

0

Chapter 2: What tasks are used to evaluate weight-sparse models?

11.084 - 14.228 Jacob Drori

TLDR. Recently, GAO et al.

0

Chapter 3: How do the results of weight-sparse models compare to dense models?

14.268 - 35.893 Jacob Drori

trained transformers with sparse weights and introduced a pruning algorithm to extract circuits that explain performance on narrow tasks. I replicate their main results and present evidence suggesting that these circuits are unfaithful to the model's true computations. This work was done as part of the Anthropic Fellows Program under the mentorship of Nick Turner and Jeff Wu.

0

35.873 - 40.861 Jacob Drori

Heading Introduction Recently, GAO et al.

0

Chapter 4: What are the potential issues with circuit faithfulness in pruned models?

40.881 - 63.797 Jacob Drori

2025 proposed an exciting approach to training models that are interpretable by design. They train transformers where only a small fraction of their weights are non-zero and find that pruning these sparse models on narrow tasks yields interpretable circuits. Their key claim is that these weight sparse models are more interpretable than ordinary dense ones, with smaller task-specific circuits.

0

64.879 - 75.973 Jacob Drori

Below, I reproduce the primary evidence for these claims. Training weight sparse models does tend to produce smaller circuits at a given task loss than dense models, and the circuits also look interpretable.

0

Chapter 5: How can pruning lead to misleading interpretations of model behavior?

77.055 - 98.98 Jacob Drori

However, there are reasons to worry that these results don't imply that we're capturing the model's full computation. For example, previous work found that similar masking techniques can achieve good performance on vision tasks even when applied to a model with random weights. Therefore, we might worry that the pruning method can find circuits that were not really present in the original model.

0

100.022 - 102.226 Jacob Drori

I present evidence that the worry is justified.

0

Chapter 6: What evidence supports the unfaithfulness of pruned circuits?

102.687 - 127.322 Jacob Drori

Namely, pruned circuits can Achieve low cross-entropy loss on a nonsensical task. Solve tasks using uniform attention patterns even when the original model's attention pattern was importantly non-uniform. Repurpose nodes to perform different functions than they did in the original model. Behave very differently to the model on inputs that are slightly out of the distribution used for pruning.

0

128.703 - 148.774 Jacob Drori

Overall, these results suggest that circuits extracted from weight-sparse models, even when interpretable, should be scrutinized for faithfulness. More generally, in interpretability research, we should not purely try to push the Pareto frontier of circuit size and task performance, since doing so may produce misleading explanations of model behavior.

0

149.816 - 159.55 Jacob Drori

In this post, I briefly review the tasks I designed to test the sparse model methods, present a basic replication of the major results from Giao et al., and then give 4.

0

Chapter 7: How do attention patterns differ between pruned and original models?

159.93 - 185.842 Jacob Drori

Lines of evidence suggesting that their pruning algorithm produce unfaithful circuits. My code for training and analyzing weight sparse models is here. It is similar to Gio et al's open source code, but it additionally implements the pruning algorithm, bridges training, multi-GPU support, and an interactive circuit viewer. Training also runs roughly 3x faster in my tests. Heading. Tasks.

0

Chapter 8: What are the implications of the findings for future AI interpretability research?

186.963 - 212.473 Jacob Drori

I extract weight sparse circuits via pruning on the following three natural language tasks. For more details on training and pruning, see the appendix. Subheading. Task 1. Pronoun Matching. Prompts have the form when, name, action, pronoun. For example. When Leo ran to the beach, he. When M.I.A. was at the park, she.

0

213.274 - 236.605 Jacob Drori

The names are sampled from the 10 most common names, 5 male, 5 female, from the pre-training set, Simple Stories. The task loss used for pruning is the CE in predicting the final token, he wore she. Subheading. Task 2. Simplified IOI. I use a simplified version of the standard indirect object identification task.

0

237.626 - 257.088 Jacob Drori

Prompts have the form when name underscore one, action, name underscore two, verb, pronoun matching name underscore one. For example. When Leah went to the shop, MIA urged him. When Rita was at the house, Alex hugged her. The task loss used for pruning is the binary CE.

0

257.929 - 279.752 Jacob Drori

We first compute the model's probability distribution just over him and her, soft-maxing just those two logits, and then compute the CE using those probabilities. Subheading. Task 3. Question marks. The prompts are short sentences from the pre-training set that either end in a period or a question mark, filtered to keep only those where 1.

0

280.012 - 301.793 Jacob Drori

The dense model predicts the correct final token, period or question mark, with p greater than 0.3, and 2. When restricted to just the period and question mark, the probability that the dense model assigns to the correct token is greater than 0.8. For example, Why do you want that key? That is why I want the key.

303.241 - 332.258 Jacob Drori

The task loss used for pruning is the binary CE, soft-maxing only the question mark and dot logits. Heading Results See the appendix for a slightly tangential investigation into the role of layer norm when extracting sparse circuits. Subheading Producing sparse interpretable circuits. Subheading zero ablation yields smaller circuits than mean ablation. When pruning, GAO et al.

332.298 - 350.34 Jacob Drori

set mast activations to their mean values over the pre-training set. I found that zero ablation usually leads to much smaller circuits at a given loss, that is in all subplots below except the third row, rightmost column. Hence I used zero ablation for the rest of the project. There's an image here.

351.421 - 360.165 Unknown

Description Graphs showing mean versus zero ablation comparison, trainable LN, across different model sizes and tasks.

362.107 - 386.904 Jacob Drori

Subheading. Weight sparse models usually have smaller circuits. Figure 2 from GAO et al. mostly replicates. In the pronoun and IOI tasks, the sparse models have smaller circuits than the dense model at a given loss. On the question task, only two of the sparse models have smaller circuits than the dense one, and even then, the reduction in size is smaller than it was for the other two tasks.

Comments

There are no comments yet.

Please log in to write the first comment.