Zach Furman

"Deep learning as program synthesis" by Zach Furman

Meanwhile, even the most efficient versions of Solomon-off induction like speed induction run in exponential time or worse.

3578.622 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

If deep learning is efficiently performing some version of program synthesis analogous to Solomonov induction, that means it has implicitly managed to do what we could not figure out how to do explicitly, its efficiency must be due to some insight which we do not yet know.

3585.968 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Of course, we know part of the answer.

3599.185 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

SGD only needs local information in order to optimize, instead of brute force global search as one does with Bayesian learning.

3603.211 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

But then the mystery becomes a well-known one.

3610.881 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Why does myopic search like SGD converge to globally good solutions?

3613.424 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Both of these are questions about the optimization process.

3618.632 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

It is not obvious at all how local optimizers like SGD would be able to perform something like Solomonoff induction, let alone far more efficiently than we historically ever figured out for versions of Solomonoff induction itself.

3622.598 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

This is a difficult question, but I will attempt to point towards research which I believe can answer these questions.

3635.859 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

The optimization process can depend on many things, a priori.

3642.247 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Choice of optimizer, regularization, dropout, step size, etc.

3646.754 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

But we can note that deep learning is able to work somewhat successfully, albeit sometimes with degraded performance across wide ranges of choices of these variables.

3652.944 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

It does not seem like the choice of Adam W. versus SGD matters nearly as much as the choice to do gradient-based learning in the first place.

3662.719 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

In other words, I believe these variables may affect efficiency, but I doubt they are fundamental to the explanation of why the optimization process can possibly succeed.

3670.664 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Instead, there is one common variable here which appears to determine the vast majority of the behavior of stochastic optimizers.

3680.64 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

The loss function.

3688.079 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Optimizers like SGD take every gradient step according to a minibatch loss function like mean squared error.