Jacob Drori

I found that zero ablation usually leads to much smaller circuits at a given loss, that is in all subplots below except the third row, rightmost column.

336.883 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Hence I used zero ablation for the rest of the project.

345.854 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

349.539 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Subheading.

362.107 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Weight sparse models usually have smaller circuits.

363.529 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Figure 2 from GAO et al.

367.153 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

mostly replicates.

368.654 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In the pronoun and IOI tasks, the sparse models have smaller circuits than the dense model at a given loss.

370.797 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

On the question task, only two of the sparse models have smaller circuits than the dense one, and even then, the reduction in size is smaller than it was for the other two tasks.

377.693 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

387.966 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Weight sparse circuits look interpretable.

399.099 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment