Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
The first is the phenomenon of degeneracies.
Consider, for instance, dead neurons, whose incoming weights and activations are such that the neurons never fires for any input.
A neural network with dead neurons acts like a smaller network with those dead neurons removed.
This gives a mechanism for neural networks to change their effective size in a parameter-dependent way, which is required in order to for example dynamically add or remove a new subroutine depending on where you are in parameter space, as in our example above.
In fact dead neurons are just one example in a whole zoo of degeneracies with similar effects which seem incredibly pervasive in neural networks.
It is worth mentioning that the present picture is now highly suggestive of a specific branch of math known as algebraic geometry.
Algebraic geometry, in particular, singularity theory, systematically studies these degeneracies and further provides a bridge between discrete structure, algebra, and continuous structure, geometry, exactly the type of connection we identified as necessary for the program synthesis hypothesis.
Furthermore, singular learning theory tells us how these degeneracies control the lost landscape and the learning process, classically, only in the Bayesian setting, a limitation we discuss in the next section.
There is much more that can be said here, but I leave it for the future to treat this material properly.
Subheading.
The search problem.
There's another problem with this story.
Our hypothesis is that deep learning is performing some version of program synthesis.
That means that we not only have to explain how programs get represented in neural networks, we also need to explain how they get learned.
There are two subproblems here.
First, how can deep learning even implement the needed inductive biases?
For deep learning algorithms to be implementing something analogous to Solomonoff induction, they must be able to implicitly follow inductive biases which depend on the program structure, like simplicity bias.
That is, the optimization process must somehow be aware of the program structure in order to favor some types of programs, for example shorter programs, over others.
the optimizer must see the program structure of parameters.
Second, deep learning works in practice, using a reasonable amount of computational resources.