Jacob Hilton
๐ค SpeakerAppearances Over Time
Podcast Appearances
We evaluate mechanistic estimates in one of two ways.
We use surprise accounting to determine whether we have achieved a full understanding.
But for practical purposes, we simply look at mean squared error as a function of compute, which allows us to compare the estimate with sampling.
In the rest of this post, we will.
Review our perspective on mechanistic estimates as explanations, including our two ways of evaluating mechanistic estimates.
Walk through three ALGZU RNNs that we've studied, the smallest of which we fully understand, and the largest of which we don't.
Conclude with some thoughts on how ARC's approach relates to ambitious mechanistic interpretability.
Heading Mechanistic Estimates as Explanations
Models from AGZOO are trained to perform a simple algorithmic task, such as calculating the position of the second largest number in a sequence.
To explain why such a model has good performance, we can produce a mechanistic estimate of its accuracy.
By mechanistic, we mean that the estimate reasons deductively based on the structure of the model, in contrast to a sampling-based estimate, which makes inductive inferences about the overall performance from individual examples.
Further explanation of this concept can be found here.
Not all mechanistic estimates are high quality.
For example, if the model had to choose between 10 different numbers before doing any analysis at all, we might estimate the accuracy of the model to be 10%.
This would be a mechanistic estimate, but a very crude one.
So we need some way to evaluate the quality of a mechanistic estimate.
We generally do this using one of two methods.
One, mean squared error versus compute.
The more conceptually straightforward way to evaluate a mechanistic estimate is to simply ask how close it gets to the model's actual accuracy.
The more compute intensive the mechanistic estimate, the closer it should get to the actual accuracy.