Jacob Hilton

This post covers work done by several researchers at Visitors2 and collaborators of ARC, including Zihau Chen, George Robinson, David Matolksi, Jacob Stavrianos, Jiwei Li and Michael Sklo.

25.177 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Thanks to Arian Bat, Gabriel Wu, Jiwei Li, Li Shaki, Victor Lecomte and Zihau Chen for comments.

38.373 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

In the wake of recent debate about pragmatic versus ambitious visions for mechanistic interpretability, ARC is sharing some models we've been studying that, in spite of their tiny size, serve as challenging test cases for any ambitious interpretability vision.

45.779 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The models are RNNs and transformers trained to perform algorithmic tasks, and range in size from 8 to 1,408 parameters.

60.278 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The largest model that we believe we more or less fully understand has 32 parameters.

69.27 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The next largest model that we have put substantial effort into, but have failed to fully understand, has 432 parameters.

74.433 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The models are available at the Algzoo GitHub repo.

82.783 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We think that the ambitious side of the mechanistic interpretability community has historically underinvested in fully understanding slightly complex models compared to partially understanding incredibly complex models.

86.467 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

There has been some prior work aimed at full understanding, for instance on models trained to perform pair and balancing, modular addition and more general group operations, but we still don't think the field is close to being able to fully understand our models, at least, not in the sense we discuss in this post.

97.72 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

If we are going to one day fully understand multibillion parameter LLMs, we probably first need to reach the point where fully understanding models with a few hundred parameters is pretty easy.

114.898 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We hope that ALGZU will spur research to either help us reach that point or help us reckon with the magnitude of the challenge we face.

125.108 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

One likely reason for this underinvestment is lingering philosophical confusion over the meaning of explanation and full understanding.

133.058 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Our current perspective at ARC is that, given a model that has been optimized for a particular loss, an explanation of the model amounts to a mechanistic estimate of the model's loss.

140.528 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment