Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
And if so, what does that tell us about what deep learning is actually doing?
It's worth noting what was and wasn't in the training data.
The data contained input-output pairs 32 and 41 give 73, and so on.
It contained nothing about how to compute them.
The network arrived at a method on its own.
And both solutions, the lookup table and the trigonometric algorithm, fit the training data equally well.
The network's loss was already near minimal during the memorization phase.
Whatever caused it to keep searching, to eventually settle on the generalizing algorithm instead, it wasn't that the generalizing algorithm fit the data better.
It was something else, some property of the learning process that favored one kind of solution over another.
The generalizing algorithm is, in a sense, simpler.
It compresses what would otherwise be thousands of stored associations into a compact procedure.
Whether that's the right way to think about what happened here, whether simplicity is really what the training process favors, is not obvious.
But something made the network prefer a mechanistic solution that generalized over one that didn't, and it wasn't the training data alone.
Subheading.
Vision circuits.
Grokking is a controlled setting, a small network, a simple task, designed to be fully interpretable.
Does the same kind of structure appear in realistic models solving realistic problems?
Ola Ital, 2020, study Inception V1, an image classification network trained on ImageNet, a dataset of over a million photographs labeled with object categories.
The network takes in an image and outputs a probability distribution over a thousand possible labels, car, dog, coffee mug, and so on.
Can we understand this more realistic setting?