Jacob Drori

👤 Speaker

274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In the original model, this head attends strongly from the last token to the name token, Rita, as one would expect.

634.497 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

But in the pruned model, its attention pattern is uniform, since there are no query or key vector nodes.

641.428 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

648.038 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

How does the pruned circuit get away with not bothering to compute an attention pattern?

660.665 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It does so by having all its value vector nodes be ones that fire strongly on names and very weakly everywhere else.

665.362 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

So even though the head attends to all tokens, it only moves information from the name token.

672.551 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Such a mechanism was not available to the original model.

678.159 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The circuit we found misses a crucial part of what the original model was doing.

681.924 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

686.93 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Nodes can play different roles in the pruned model.

688.372 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Example 1.

691.696 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below are the activations of layer 0, node 1651 from this IOI circuit.

693.222 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The left figure shows its activations in the pruned model, where it activates negatively, red, on female names.

699.412 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The right figure shows its activations in the original model, where it activates positively, blue, on male names.

706.703 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In both cases, its activation is very close to 0 for all non-name tokens.

714.215 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This is strange.

719.834 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The node acquires a different meaning after pruning.

721.576 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

724.86 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below are activations of attention underscore out, layer 1, node 244 from this IOI circuit.

743.459 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In the pruned model, the node activates positively on contexts where name underscore one, the first appearing name, is female, and negatively on ones where it is male.

750.563 View full episode →

← Previous Page 7 of 14 Next →

Report any issue