Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
You don't do something stupid like condition them badly numerically.
And they wanna learn.
They'll do it.
Dario Amodei.
End quote.
I remember when I trained my first neural network, there was something almost miraculous about it.
It could solve problems which I had absolutely no idea how to code myself, for example how to distinguish a cat from a dog, and in a completely opaque way such that even after it had solved the problem I had no better picture for how to solve the problem myself than I did beforehand.
Moreover, it was remarkably resilient, despite obvious problems with the optimizer, or bugs in the code, or bad training data, unlike any other engineered system I had ever built, almost reminiscent of something biological in its robustness.
My impression is that this sense of magic is a common, if often unspoken, experience among practitioners.
Many simply learn to accept the mystery and get on with the work.
But there is nothing virtuous about confusion, it just suggests that your understanding is incomplete, that you are ignorant of the real mechanisms underlying the phenomenon.
Our practical success with deep learning has outpaced our theoretical understanding.
This has led to a proliferation of explanations that often feel ad hoc and local, tailor-made to account for a specific empirical finding without connecting to other observations or any larger framework.
For instance, the theory of double descent provides a narrative for the U-shaped test-loss curve, but it is a self-contained story.
It does not, for example, share a conceptual foundation with the theories we have for how induction heads form in transformers.
Each new discovery seems to require a new, bespoke theory.
One naturally worries that we are juggling epicycles.
This sense of theoretical fragility is compounded by a second problem.
For any single one of these phenomena, we often lack consensus, entertaining multiple, competing hypotheses.
Consider the core question of why neural networks generalize.