Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Is it best explained by the implicit bias of SGD towards flat minima, the behavior of neural tangent kernels, or some other property?
The field actively debates these views.
And where no mechanistic theory has gained traction, we often retreat to descriptive labels.
We say complex abilities are an emergent property of scale, a term that names the mystery without explaining its cause.
This theoretical disarray is sharpest when we examine our most foundational frameworks.
Here, the issue is not just a lack of consensus but a direct conflict with empirical reality.
This disconnect manifests in several ways.
Sometimes, our theories make predictions that are actively falsified by practice.
Classical statistical learning theory, with its focus on the bias-variance trade-off, advises against the very scaling strategies that have produced almost all state-of-the-art performance.
In other cases, a theory might be technically true but practically misleading, failing to explain the key properties that make our models effective.
The universal approximation theorem, for example, guarantees representational power but does so via a construction that implies an exponential scaling that our models somehow avoid.
And in yet other areas, our classical theories are almost entirely silent.
They offer no framework to even begin explaining deep puzzles like the uncanny convergence of representations across vastly different models trained on the same data.
We are therefore faced with a collection of major empirical findings where our foundational theories are either contradicted, misleading, or simply absent.
This theoretical vacuum creates an opportunity for a new perspective.
The program synthesis hypothesis offers such a perspective.
It suggests we shift our view of what deep learning is fundamentally doing from statistical function fitting to program search.
The specific claim is that deep learning performs a search for simple programs that explain the data.
This shift in viewpoint may offer a way to make sense of the theoretical tensions we have outlined.
If the learning process is a search for an efficient program rather than an arbitrary function, then the circumvention of the curse of dimensionality is no longer so mysterious.