Zach Furman

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

And stored associations don't help you with new inputs.

618.587 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Here's what's unexpected.

622.294 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

If you keep training, despite the training loss already nearly as low as it can go, the network eventually starts getting the held-out pairs right too.

624.423 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Not gradually, either.

633.22 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Test performance jumps from chance to near-perfect over only a few thousand training steps.

635.244 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

So something has changed inside the network.

640.955 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

But what?

644.065 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

It was already fitting the training data.

645.607 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

The data didn't change.

648.21 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

There's no external signal that could have triggered the shift.

650.473 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

One way to investigate is to look at the weights themselves.

654.157 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

We can do this at multiple checkpoints over training and ask, does something change in the weights around the time generalization begins?

657.902 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

It does.

666.151 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

The weights early in training, during the memorization phase, don't have much structure when you analyze them.

667.673 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Later, they do.

673.917 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Specifically, if we look at the embedding matrix, we find that it's mapping numbers to particular locations on a circle.

675.099 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

The number zero maps to one position, one maps to a position slightly rotated from that, and so on, wrapping around.

683.231 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

More precisely, the embedding of each number contains sine and cosine values at a small set of specific frequencies.

690.927 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

This structure is absent early in training.

699.056 View full episode →

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

It emerges as training continues, and it emerges around the same time that generalization begins.

702.32 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment