Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Zach Furman

๐Ÿ‘ค Speaker
696 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

And stored associations don't help you with new inputs.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Here's what's unexpected.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

If you keep training, despite the training loss already nearly as low as it can go, the network eventually starts getting the held-out pairs right too.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Not gradually, either.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Test performance jumps from chance to near-perfect over only a few thousand training steps.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

So something has changed inside the network.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

But what?

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It was already fitting the training data.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The data didn't change.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

There's no external signal that could have triggered the shift.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

One way to investigate is to look at the weights themselves.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

We can do this at multiple checkpoints over training and ask, does something change in the weights around the time generalization begins?

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It does.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The weights early in training, during the memorization phase, don't have much structure when you analyze them.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Later, they do.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Specifically, if we look at the embedding matrix, we find that it's mapping numbers to particular locations on a circle.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The number zero maps to one position, one maps to a position slightly rotated from that, and so on, wrapping around.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

More precisely, the embedding of each number contains sine and cosine values at a small set of specific frequencies.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

This structure is absent early in training.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It emerges as training continues, and it emerges around the same time that generalization begins.