Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
It discovers the Fourier-based algorithm for modular addition.
Coincident with the discovery of this algorithmic program, or rather, the removal of the memorization program, which occurs slightly later, test accuracy abruptly jumps to 100%.
The sudden increase in generalization appears to be the direct consequence of the model replacing a complex, overfitting solution with a simpler, algorithmic one.
In this instance, generalization is achieved through the synthesis of a different, more efficient program.
Subheading.
The paradox of convergence.
When we ask a neural network to solve a task, we specify what task we'd like it to solve, but not how it should solve the task, the purpose of learning is for it to find strategies on its own.
We define a loss function and an architecture, creating a space of possible functions, and ask the learning algorithm to find a good one by minimizing the loss.
Given this freedom and the high dimensionality of the search space, one might expect the solutions found by different models, especially those with different architectures or random initializations, to be highly diverse.
Instead, what we observe empirically is a strong tendency towards convergence.
This is most directly visible in the phenomenon of representational alignment.
This alignment is remarkably robust.
It holds across different training runs of the same architecture, showing that the final solution is not a sensitive accident of the random seed.
More surprisingly, it holds across different architectures.
The internal activations of a transformer and a CNN trained on the same vision task, for example, can often be mapped to one another with a simple linear transformation, suggesting they are learning not just similar input-output behavior, but similar intermediate computational steps.
It even holds in some cases across modalities.
Models like CLIP, trained to associate images with text, learn a shared representation space where the vector for a photograph of a dog is close to the vector for the phrase a photo of a dog, indicating convergence on a common, abstract conceptual structure.
The mystery deepens when we observe parallels to biological systems.
The GABA-like filters that emerge in the early layers of vision networks, for instance, are strikingly similar to the receptive fields of neurons in the V1 area of the primate visual cortex.
It appears that evolution and stochastic gradient descent, two very different optimization processes operating on very different substrates, have converged on similar solutions when exposed to the same statistical structure of the natural world.