Zach Furman
๐ค SpeakerAppearances Over Time
Podcast Appearances
And stored associations don't help you with new inputs.
Here's what's unexpected.
If you keep training, despite the training loss already nearly as low as it can go, the network eventually starts getting the held-out pairs right too.
Not gradually, either.
Test performance jumps from chance to near-perfect over only a few thousand training steps.
So something has changed inside the network.
But what?
It was already fitting the training data.
The data didn't change.
There's no external signal that could have triggered the shift.
One way to investigate is to look at the weights themselves.
We can do this at multiple checkpoints over training and ask, does something change in the weights around the time generalization begins?
It does.
The weights early in training, during the memorization phase, don't have much structure when you analyze them.
Later, they do.
Specifically, if we look at the embedding matrix, we find that it's mapping numbers to particular locations on a circle.
The number zero maps to one position, one maps to a position slightly rotated from that, and so on, wrapping around.
More precisely, the embedding of each number contains sine and cosine values at a small set of specific frequencies.
This structure is absent early in training.
It emerges as training continues, and it emerges around the same time that generalization begins.