Jacob Drori

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hence the head attends most strongly to the first part of the prompt, and so the attention underscore out node only gets a large contribution from the value vector node when it appears near the start of the sentence.

516.44 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Specifically, the attention underscore out node gets a large negative contribution when name underscore one is male.

527.416 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The other value vector node, not shown here, gives a positive contribution when name underscore one is female.

534.426 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This explains the activation patterns we saw above for the attention underscore out node.

541.256 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

546.612 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

548.135 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example.

552.703 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When Rita went to the woods, when?

566.626 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When Leah went to the woods, dis.

569.23 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Just like the standard pronouns task, the task loss is just the standard CE loss, that is all logits are softmaxed.

571.493 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I am not using the binary CE.

579.205 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This is a nonsense task, but the pruned model gets task loss less than 0.05, meaning accuracy greater than 95%, with only roughly 30 nodes.

582.009 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

592.685 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Around 10 nodes are needed to achieve a similar loss on the ordinary pronouns task.

601.878 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

So the nonsense task does require a larger circuit than the real task, which is somewhat reassuring.

607.163 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

That said, it seems worrying that any circuit at all is able to get such low loss on the nonsense task, and 30 nodes is really not many.

613.269 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

You can view the nonsense circuit here.

621.616 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

624.541 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Important attention patterns can be absent in the pruned model.

626.063 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This pronoun circuit has attention nodes only in layer 1 head 7.

629.97 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment