Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Meanwhile, even the most efficient versions of Solomon-off induction like speed induction run in exponential time or worse.
If deep learning is efficiently performing some version of program synthesis analogous to Solomonov induction, that means it has implicitly managed to do what we could not figure out how to do explicitly, its efficiency must be due to some insight which we do not yet know.
Of course, we know part of the answer.
SGD only needs local information in order to optimize, instead of brute force global search as one does with Bayesian learning.
But then the mystery becomes a well-known one.
Why does myopic search like SGD converge to globally good solutions?
Both of these are questions about the optimization process.
It is not obvious at all how local optimizers like SGD would be able to perform something like Solomonoff induction, let alone far more efficiently than we historically ever figured out for versions of Solomonoff induction itself.
This is a difficult question, but I will attempt to point towards research which I believe can answer these questions.
The optimization process can depend on many things, a priori.
Choice of optimizer, regularization, dropout, step size, etc.
But we can note that deep learning is able to work somewhat successfully, albeit sometimes with degraded performance across wide ranges of choices of these variables.
It does not seem like the choice of Adam W. versus SGD matters nearly as much as the choice to do gradient-based learning in the first place.
In other words, I believe these variables may affect efficiency, but I doubt they are fundamental to the explanation of why the optimization process can possibly succeed.
Instead, there is one common variable here which appears to determine the vast majority of the behavior of stochastic optimizers.
The loss function.
Optimizers like SGD take every gradient step according to a minibatch loss function like mean squared error.
Complex formula omitted from the narration.
Complex formula omitted from the narration.
Where?