Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
Weight-spar circuits may be interpretable yet unfaithful.
By Jacob Drury.
Published on February 9, 2026.
TLDR.
Recently, GAO et al.
trained transformers with sparse weights and introduced a pruning algorithm to extract circuits that explain performance on narrow tasks.
I replicate their main results and present evidence suggesting that these circuits are unfaithful to the model's true computations.
This work was done as part of the Anthropic Fellows Program under the mentorship of Nick Turner and Jeff Wu.
Heading Introduction Recently, GAO et al.
2025 proposed an exciting approach to training models that are interpretable by design.
They train transformers where only a small fraction of their weights are non-zero and find that pruning these sparse models on narrow tasks yields interpretable circuits.
Their key claim is that these weight sparse models are more interpretable than ordinary dense ones, with smaller task-specific circuits.
Below, I reproduce the primary evidence for these claims.
Training weight sparse models does tend to produce smaller circuits at a given task loss than dense models, and the circuits also look interpretable.
However, there are reasons to worry that these results don't imply that we're capturing the model's full computation.
For example, previous work found that similar masking techniques can achieve good performance on vision tasks even when applied to a model with random weights.
Therefore, we might worry that the pruning method can find circuits that were not really present in the original model.
I present evidence that the worry is justified.
Namely, pruned circuits can
Achieve low cross-entropy loss on a nonsensical task.