Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Zach Furman

๐Ÿ‘ค Speaker
696 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

So what is this structure doing?

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Following it through the network reveals something unexpected.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The network has learned an algorithm for modular addition based on trigonometry.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

There's an image here.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The algorithm exploits how angles add.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

If you represent a number as a position on a circle, then adding two numbers corresponds to adding their angles.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The network's embedding layer does this representation.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Its middle layers then combine the sine and cosine values of the two inputs using trigonometric identities.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

These operations are implemented in the weights of the attention and MLP layers.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

One can read off coefficients that correspond to the terms in these identities.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Finally, the network needs to convert back to a discrete answer.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

It does this by checking for each possible output, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

How well, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Matches the sum it computed.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Specifically, the logit for output, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Depends on, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

This quantity is maximized when, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

Equals, complex formula omitted from the narration.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

The correct answer.

LessWrong (Curated & Popular)
"Deep learning as program synthesis" by Zach Furman

At that point the cosines are different frequencies all equal 1 and add constructively.