Jacob Hilton

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We evaluate mechanistic estimates in one of two ways.

150.101 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We use surprise accounting to determine whether we have achieved a full understanding.

153.982 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

But for practical purposes, we simply look at mean squared error as a function of compute, which allows us to compare the estimate with sampling.

158.57 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

In the rest of this post, we will.

166.725 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Review our perspective on mechanistic estimates as explanations, including our two ways of evaluating mechanistic estimates.

169.979 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Walk through three ALGZU RNNs that we've studied, the smallest of which we fully understand, and the largest of which we don't.

177.39 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Conclude with some thoughts on how ARC's approach relates to ambitious mechanistic interpretability.

184.882 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Heading Mechanistic Estimates as Explanations

190.551 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Models from AGZOO are trained to perform a simple algorithmic task, such as calculating the position of the second largest number in a sequence.

194.785 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

To explain why such a model has good performance, we can produce a mechanistic estimate of its accuracy.

202.977 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

By mechanistic, we mean that the estimate reasons deductively based on the structure of the model, in contrast to a sampling-based estimate, which makes inductive inferences about the overall performance from individual examples.

209.145 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Further explanation of this concept can be found here.

221.703 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Not all mechanistic estimates are high quality.

225.262 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

For example, if the model had to choose between 10 different numbers before doing any analysis at all, we might estimate the accuracy of the model to be 10%.

228.687 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

This would be a mechanistic estimate, but a very crude one.

237.199 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

So we need some way to evaluate the quality of a mechanistic estimate.

242.327 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

We generally do this using one of two methods.