Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Subheading.
The paradox of generalization.
See also this post for related discussion.
There's an image here.
Perhaps the most jarring departure from classical theory comes from how deep learning models generalize.
A learning algorithm is only useful if it can perform well on new, unseen data.
The central question of statistical learning theory is, what are the conditions that allow a model to generalize?
The classical answer is the bias-variance trade-off.
The theory posits that a model's error can be decomposed into two main sources.
Bias.
Error from the model being too simple to capture the underlying structure of the data, underfitting.
Variance.
Error from the model being too sensitive to the specific training data it saw, causing it to fit noise, overfitting.
According to this framework, learning is a delicate balancing act.
The practitioner's job is to carefully choose a model of the right complexity, not too simple, not too complex to land in a Goldilocks zone where both bias and variance are low.
This view is reinforced by principles like the no-free-lunch theorems, which suggest there is no universally good learning algorithm, only algorithms whose inductive biases are carefully chosen by a human to match a specific problem domain.
The clear prediction from this classical perspective is that naively increasing a model's capacity, for example, by adding more parameters, far beyond what is needed to fit the training data is a recipe for disaster.
Such a model should have catastrophically high variance, leading to rampant overfitting and poor generalization.
And yet, perhaps the single most important empirical finding in modern deep learning is that this prediction is completely wrong.
The bitter lesson, as Rich Sutton calls it, is that the most reliable path to better performance is to scale up compute and model size, sometimes far into the regime where the model can easily memorize the entire training set.