Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
One way to account for this is to hypothesize that the models are not navigating some undifferentiated space of arbitrary functions, but are instead homing in on a sparse set of highly effective programs that solve the task.
If, following the physical church-turing thesis, we view the natural world as having a true, computable structure, then an effective learning process could be seen as a search for an algorithm that approximates that structure.
In this light, convergence is not an accident, but a sign that different search processes are discovering similar objectively good solutions, much as different engineering traditions might independently arrive at the arch as an efficient solution for bridging a gap.
This hypothesis, that learning is a search for an optimal, objective program, carries with it a strong implication.
The search process must be a general-purpose one, capable of finding such programs without them being explicitly encoded in its architecture.
As it happens, an independent, large-scale trend in the field provides a great deal of data on this very point.
Rich Sutton's bitter lesson describes the consistent empirical finding that long-term progress comes from scaling general learning methods rather than from encoding specific human domain knowledge.
The old paradigm, particularly in fields like computer vision, speech recognition, or game playing, involved painstakingly hand-crafting systems with significant prior knowledge.
For years, the state of the art relied on complex, hand-designed feature extractors like SIFT and HOG, which were built on human intuitions about what aspects of an image are important.
The role of learning was confined to a relatively simple classifier that operated on these predigested features.
The underlying assumption was that the search space was too difficult to navigate without strong human guidance.
The modern paradigm of deep learning has shown this assumption to be incorrect.
Progress has come from abandoning these handcrafted constraints in favor of training general, end-to-end architectures with the brute force of data and compute.
This consistent triumph of general learning over encoded human knowledge is a powerful indicator that the search process we are using is, in fact, general purpose.
It suggests that the learning algorithm itself, when given a sufficiently flexible substrate and enough resources, is a more effective mechanism for discovering relevant features and structure than human ingenuity.
This perspective helps connect these phenomena but it also invites us to refine our initial picture,
First, the notion of a single optimal program may be too rigid.
It is possible that what we are observing is not convergence to a single point, but to a narrow subset of similarly structured, highly efficient programs.
The models may be learning different but algorithmically related solutions, all belonging to the same family of effective strategies.
Second, it is unclear whether this convergence is purely a property of the problem's solution space, or if it is also a consequence of our search algorithm.