Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Jacob Hilton

๐Ÿ‘ค Speaker
204 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We evaluate mechanistic estimates in one of two ways.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We use surprise accounting to determine whether we have achieved a full understanding.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

But for practical purposes, we simply look at mean squared error as a function of compute, which allows us to compare the estimate with sampling.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

In the rest of this post, we will.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Review our perspective on mechanistic estimates as explanations, including our two ways of evaluating mechanistic estimates.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Walk through three ALGZU RNNs that we've studied, the smallest of which we fully understand, and the largest of which we don't.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Conclude with some thoughts on how ARC's approach relates to ambitious mechanistic interpretability.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Heading Mechanistic Estimates as Explanations

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Models from AGZOO are trained to perform a simple algorithmic task, such as calculating the position of the second largest number in a sequence.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

To explain why such a model has good performance, we can produce a mechanistic estimate of its accuracy.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

By mechanistic, we mean that the estimate reasons deductively based on the structure of the model, in contrast to a sampling-based estimate, which makes inductive inferences about the overall performance from individual examples.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Further explanation of this concept can be found here.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Not all mechanistic estimates are high quality.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

For example, if the model had to choose between 10 different numbers before doing any analysis at all, we might estimate the accuracy of the model to be 10%.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

This would be a mechanistic estimate, but a very crude one.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

So we need some way to evaluate the quality of a mechanistic estimate.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We generally do this using one of two methods.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

One, mean squared error versus compute.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The more conceptually straightforward way to evaluate a mechanistic estimate is to simply ask how close it gets to the model's actual accuracy.

LessWrong (Curated & Popular)
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The more compute intensive the mechanistic estimate, the closer it should get to the actual accuracy.