Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Drori

๐Ÿ‘ค Speaker
274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When M.I.A.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

was at the park, she.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The names are sampled from the 10 most common names, 5 male, 5 female, from the pre-training set, Simple Stories.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The task loss used for pruning is the CE in predicting the final token, he wore she.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Task 2.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Simplified IOI.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I use a simplified version of the standard indirect object identification task.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Prompts have the form when name underscore one, action, name underscore two, verb, pronoun matching name underscore one.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For example.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When Leah went to the shop, MIA urged him.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When Rita was at the house, Alex hugged her.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The task loss used for pruning is the binary CE.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

We first compute the model's probability distribution just over him and her, soft-maxing just those two logits, and then compute the CE using those probabilities.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Task 3.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Question marks.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The prompts are short sentences from the pre-training set that either end in a period or a question mark, filtered to keep only those where 1.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The dense model predicts the correct final token, period or question mark, with p greater than 0.3, and 2.

LessWrong (Curated & Popular)
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

When restricted to just the period and question mark, the probability that the dense model assigns to the correct token is greater than 0.8.