Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
The learning process is not looking for any program that fits the data.
It is looking for the simplest program.
Giving the search more resources, parameters, compute, data, provides a better opportunity to find the simple, generalizable program that corresponds to the true underlying structure, rather than settling for a more complex, memorizing one.
Second, why does generalization depend on the data's structure?
This is a natural consequence of a simplicity-biased program search.
When trained on real data, there exists a short, simple program that explains the statistical regularities, for example, cats have pointy ears and whiskers.
The simplicity bias of the learning process finds this program, and because it captures the true structure, it generalizes well.
When trained on random labels, no such simple program exists.
The only way to map the given images to the random labels is via a long, complicated, high-complexity program, effectively a look-up table.
Forced against its inductive bias, the learning algorithm eventually finds such a program to minimize the training loss.
This solution is pure memorization and, naturally, fails to generalize.
If one assumes something like the program synthesis hypothesis is true, the phenomenon of data-dependent generalization is not so surprising.
A model's ability to generalize is not a fixed property of its architecture, but a property of the program it learns.
The model finds a simple program on the real dataset and a complex one on the random dataset, and the two programs have very different generalization properties and there is some evidence that the mechanism behind generalization is not so unrelated to the other empirical phenomena we have discussed.
We can see this in the grokking setting discussed earlier.
Recall the transformer trained on modular addition.
Initially, the model learns a memorization-based program.
It achieves 100% accuracy on the training data, but its test accuracy is near zero.
This is analogous to learning the random label dataset, a complex, non-generalizing solution.
After extensive further training, driven by a regularizer that penalizes complexity, weight decay, the model's internal solution undergoes a phase transition.