Jacob Drori
๐ค SpeakerAppearances Over Time
Podcast Appearances
You can view circuits for each task here.
Hovering over, clicking on a node shows its activations either in the original or the pruned model.
Below is a brief summary of how I think the IOI circuit works.
I walk through circuits for the other two tasks in the appendix.
Each circuit I walk through here was extracted from the complex formula omitted from the narration.
Model.
I did not inspect circuits extracted from the other models as carefully.
All the patoken activations shown in this section are taken from the pruned model, not the original model.
IOI task, view circuit.
Below is an important node from layer 1 attention underscore out.
It activates positively on prompts where name underscore 1 is female and negatively on prompts where it is male.
It then suppresses the TIM logic.
To see how this node's activation is computed, we can inspect the value vector nodes it reads from and the corresponding key and query nodes.
The value vector node shown below activates negatively on male names.
There's an image here.
There are two query key pairs.
The first query vector always has negative activations, not shown.
The corresponding key node's activation is negative, with magnitude roughly decreasing as a function of token position.
There's an image here.
The other query key pair does the same thing but with positive activations.