Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
This goes beyond a minor deviation from theoretical predictions.
It is a direct contradiction of the theory's core prescriptive advice.
This brings us to a second, deeper puzzle, first highlighted by Jong It-al, 2017, The Authors Conduct a Simple Experiment.
They train a standard vision model on a real dataset, for example, CIFAR-10, and confirm that it generalizes well.
They then train the exact same model, with the exact same architecture, optimizer, and regularization, on a corrupted version of the dataset where the labels have been completely randomized.
The network is expressive enough that it is able to achieve near-zero training error on the randomized labels, perfectly memorizing the nonsensical data.
As expected, its performance on a test set is terrible.
It has learned nothing generalizable.
The paradox is this.
Why did the same exact model generalize well on the real data?
Classical theories often tie a model's generalization ability to its capacity for complexity, which is a fixed property of its architecture related to its expressivity.
But this experiment shows that generalization is not a static property of the model.
It is a dynamic outcome of the interaction between the model, the learning algorithm, and the structure of the data itself.
The very same network that is completely capable of memorizing random noise somehow chooses to find a generalizable solution when trained on data with real structure.
Why?
The program synthesis hypothesis offers a coherent explanation for both of these paradoxes.
First, why does scaling work?
The hypothesis posits that learning is a search through some space of programs, guided by a strong simplicity bias.
In this view, adding more parameters is analogous to expanding the search space, for example, allowing for longer or more complex programs.
While this does increase the model's capacity to represent overfitting solutions, the simplicity bias acts as a powerful regularizer.