Jacob Drori

👤 Speaker

274 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

There's an image here.

1522.687 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

I think it is morally correct to freeze Léon during pruning so that the model cannot cheat in the way described above.

1547.11 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

but it seems doing so does not fully fix the faithfulness issues, see the IOI, complex formula omitted from the narration.

1553.948 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Results directly above.

1561.297 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

A final caveat to the results in this appendix.

1563.741 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

For each model and task, I performed a carb sweep to find the best hyperparameters for pruning and then used these best hyperparameters for each of the 100 random laceded pruning runs.

1567.045 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It may be the case that for example for the, complex formula omitted from the narration,

1578.079 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

We happened to find unlucky hyperparameters that lead to poor generalization to same gender prompts, whereas we got lucky with the hyperparameters we found for the complex formula omitted from the narration.

1582.771 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Model.

1594.087 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

In other words, the 100 seeds are perhaps not as decorrelated as we'd like.

1595.449 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

This article was narrated by Type 3 Audio for Less Wrong.

1601.197 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

It was published on February 9, 2026.

1605.502 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

The original text contained six footnotes which were omitted from the narration.

1609.617 View full episode →

LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Images are included in the podcast episode description.

1614.394 View full episode →

← Previous Page 14 of 14 Next →

Report any issue