Jacob Hilton

The models in ALGZU are small, but for all but the tiniest of them, it is a considerable challenge to mechanistically estimate their accuracy competitively with sampling, let alone fully understand them in the sense of surprise accounting.

1195.945 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

At the same time, ALGZU models are trained on tasks that can easily be performed by LLMs, so fully understanding them is practically a prerequisite for ambitious LLM interpretability.

1208.523 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Overall, we would be keen to see other ambitious-oriented researchers explore our models, and more concretely, we would be excited to see better mechanistic estimates for our models in the sense of mean squared error versus compute.

1219.098 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

One specific challenge we pose is the following.

1232.402 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Challenge.

1235.635 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Design a method for mechanistically estimating the accuracy of the 432 parameter model.

1237.036 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Complex formula omitted from the narration.

1242.722 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

That matches the performance of random sampling in terms of mean squared error versus compute.

1245.205 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

A cheap way to measure mean squared error is to add noise to the model's weights enough to significantly alter the model's accuracy and check the squared error of the method on average over the choice of noisy model.

1250.81 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

How does ARC's broader approach relate to this?

1261.841 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

The analysis we have presented here is relatively traditional mechanistic interpretability, but we think of this analysis mainly as a warm-up.

1264.93 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Ultimately, we seek a scalable, algorithmic approach to producing mechanistic estimates, which we have been pursuing in our recent work.

1273.327 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Furthermore, we are ambitious in the sense that we would like to fully exploit the structure present in models to mechanistically estimate any quantity of interest.

1281.265 View full episode →

LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Thus our approach is best described as ambitious and mechanistic, but perhaps not as interpretability.

1289.738 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment