Jacob Drori

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

You can view circuits for each task here.

401.983 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hovering over, clicking on a node shows its activations either in the original or the pruned model.

405.098 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below is a brief summary of how I think the IOI circuit works.

411.045 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I walk through circuits for the other two tasks in the appendix.

415.289 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Each circuit I walk through here was extracted from the complex formula omitted from the narration.

419.334 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Model.

424.7 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I did not inspect circuits extracted from the other models as carefully.

425.781 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

All the patoken activations shown in this section are taken from the pruned model, not the original model.

430.057 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

IOI task, view circuit.

436.346 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Below is an important node from layer 1 attention underscore out.

439.251 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It activates positively on prompts where name underscore 1 is female and negatively on prompts where it is male.

443.557 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It then suppresses the TIM logic.

450.347 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

To see how this node's activation is computed, we can inspect the value vector nodes it reads from and the corresponding key and query nodes.

463.124 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The value vector node shown below activates negatively on male names.

471.676 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

476.563 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There are two query key pairs.

488.199 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The first query vector always has negative activations, not shown.

490.782 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The corresponding key node's activation is negative, with magnitude roughly decreasing as a function of token position.

495.488 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

502.618 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The other query key pair does the same thing but with positive activations.

511.965 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment