Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Zach Furman

๐Ÿ‘ค Speaker
696 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The learning process is not looking for any program that fits the data.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It is looking for the simplest program.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Giving the search more resources, parameters, compute, data, provides a better opportunity to find the simple, generalizable program that corresponds to the true underlying structure, rather than settling for a more complex, memorizing one.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Second, why does generalization depend on the data's structure?

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

This is a natural consequence of a simplicity-biased program search.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

When trained on real data, there exists a short, simple program that explains the statistical regularities, for example, cats have pointy ears and whiskers.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The simplicity bias of the learning process finds this program, and because it captures the true structure, it generalizes well.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

When trained on random labels, no such simple program exists.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The only way to map the given images to the random labels is via a long, complicated, high-complexity program, effectively a look-up table.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Forced against its inductive bias, the learning algorithm eventually finds such a program to minimize the training loss.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

This solution is pure memorization and, naturally, fails to generalize.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

If one assumes something like the program synthesis hypothesis is true, the phenomenon of data-dependent generalization is not so surprising.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

A model's ability to generalize is not a fixed property of its architecture, but a property of the program it learns.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The model finds a simple program on the real dataset and a complex one on the random dataset, and the two programs have very different generalization properties and there is some evidence that the mechanism behind generalization is not so unrelated to the other empirical phenomena we have discussed.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

We can see this in the grokking setting discussed earlier.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Recall the transformer trained on modular addition.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Initially, the model learns a memorization-based program.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It achieves 100% accuracy on the training data, but its test accuracy is near zero.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

This is analogous to learning the random label dataset, a complex, non-generalizing solution.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

After extensive further training, driven by a regularizer that penalizes complexity, weight decay, the model's internal solution undergoes a phase transition.