LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

24 Jan 2026

1h 11m

10933 words

2 speakers

24 Jan 2026

Audio

Description

Audio note: this article contains 73 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Epistemic status: This post is a synthesis of ideas that are, in my experience, widespread among researchers at frontier labs and in mechanistic interpretability, but rarely written down comprehensively in one place - different communities tend to know different pieces of evidence. The core hypothesis - that deep learning is performing something like tractable program synthesis - is not original to me (even to me, the ideas are ~3 years old), and I suspect it has been arrived at independently many times. (See the appendix on related work). This is also far from finished research - more a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post. Thanks to Dan Murfet, Jesse Hoogland, Max Hennick, and Rumi Salazar for feedback on this post. Sam Altman: Why does unsupervised learning work? Dan Selsam: Compression. So, the ideal intelligence [...] ---Outline:(02:31) Background(09:06) Looking inside(09:09) Grokking(16:04) Vision circuits(22:37) The hypothesis(26:04) Why this isnt enough(27:22) Indirect evidence(32:44) The paradox of approximation(38:34) The paradox of generalization(45:44) The paradox of convergence(51:46) The path forward(53:20) The representation problem(58:38) The search problem(01:07:20) Appendix(01:07:23) Related work The original text contained 14 footnotes which were omitted from this narration. --- First published: January 20th, 2026 Source: https://www.lesswrong.com/posts/Dw8mskAvBX37MxvXo/deep-learning-as-program-synthesis-1 --- Narrated by TYPE III AUDIO. ---Images from the article:

Chapters

1. What is the core hypothesis of deep learning as program synthesis? 2. What insights do we gain from examining grokking in neural networks? 3. How do vision circuits illustrate deep learning's capabilities? 4. What challenges arise when formalizing deep learning processes? 5. How does the paradox of approximation affect deep learning? 6. What is the significance of generalization in neural networks? 7. How do convergence issues relate to deep learning performance? 8. What future directions are suggested for research in this area?

Featured

Zach Furman

Unknown

Transcription

Chapter 1: What is the core hypothesis of deep learning as program synthesis?

0.031 - 22.867 Zach Furman

Deep Learning as Program Synthesis by Zach Furman. Published on January 20, 2026. Audio note. This article contains 73 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Epistemic status.

23.748 - 38.087 Zach Furman

This post is a synthesis of ideas that are, in my experience, widespread among researchers at frontier labs and in mechanistic interpretability, but rarely written down comprehensively in one place, different communities tend to know different pieces of evidence.

38.27 - 54.056 Zach Furman

The core hypothesis, that deep learning is performing something like tractable program synthesis, is not original to me, even to me, the ideas are roughly three years old, and I suspect it has been arrived at independently many times. See the appendix on related work.

Chapter 2: What insights do we gain from examining grokking in neural networks?

55.097 - 81.27 Zach Furman

This is also far from finished research, more a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post. Thanks to Dan Murfitt, Jesse Hoogland, Max Hennig, and Rumi Salazar for feedback on this post. Quote. Sam Altman.

82.111 - 85.936 Zach Furman

Why does unsupervised learning work? Dan Salesam.

Chapter 3: How do vision circuits illustrate deep learning's capabilities?

86.737 - 103.955 Zach Furman

Compression. So, the ideal intelligence is called Solomonov induction. End quote. The central hypothesis of this post is that deep learning succeeds because it's performing a tractable form of program synthesis, searching for simple, compositional algorithms that explain the data.

104.997 - 120.937 Zach Furman

If correct, this would reframe deep learning's success as an instance of something we understand in principle, while pointing toward what we would need to formalize to make the connection rigorous. I first review the theoretical ideal of Solomonov induction and the empirical surprise of deep learning success.

Chapter 4: What challenges arise when formalizing deep learning processes?

122.058 - 144.248 Zach Furman

Next, mechanistic interpretability provides direct evidence that networks learn algorithm-like structures. I examine the cases of grokking and vision circuits in detail. Broader patterns provide indirect support. How networks evade the curse of dimensionality, generalize despite overparameterization, and converge on similar representations.

144.616 - 170.202 Zach Furman

Finally, I discuss what formalization would require, why it's hard, and the path forward it suggests. Heading Background Quote Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.

170.62 - 185.114 Zach Furman

Chain leg. End quote. Early in childhood, human babies learn object permanence that unseen objects nevertheless persist even when not directly observed. In doing so, their world becomes a little less confusing.

Chapter 5: How does the paradox of approximation affect deep learning?

185.875 - 209.673 Zach Furman

It is no longer surprising that their mother appears and disappears by putting hands in front of her face. They move from raw sensory perception towards interpreting their observations as coming from an external world. A coherent, self-consistent process which determines what they see, feel, and hear. As we grow older, we refine this model of the world. We learn that fire hurts when touched.

210.514 - 213.138 Zach Furman

Later, that one can create fire with wooden matches.

Chapter 6: What is the significance of generalization in neural networks?

214.039 - 231.605 Zach Furman

Eventually, that fire is a chemical reaction involving fuel and oxygen. At each stage, the world becomes less magical and more predictable. We are no longer surprised when a stove burns us or when water extinguishes a flame because we have learned the underlying process that governs their behavior.

232.646 - 254.524 Zach Furman

This process of learning only works because the world we inhabit, for all its apparent complexity, is not random. It is governed by consistent, discoverable rules. If dropping a glass causes it to shatter on Tuesday, it will do the same on Wednesday. If one pushes a ball off the top of a hill, it will roll down, at a rate that any high school physics student could predict.

255.566 - 259.212 Zach Furman

Through our observations, we implicitly reverse-engineer these rules.

Chapter 7: How do convergence issues relate to deep learning performance?

260.354 - 284.191 Zach Furman

This idea, that the physical world is fundamentally predictable and rule-based, has a formal name in computer science. The Physical Church Turing Thesis Precisely, it states that any physical process can be simulated to arbitrary accuracy by a Turing machine. Anything from a star collapsing to a neuron firing can, in principle, be described by an algorithm and simulated on a computer.

285.212 - 291.521 Zach Furman

From this perspective, one can formalize this notion of building a world model by reverse engineering rules from what we can see.

Chapter 8: What future directions are suggested for research in this area?

291.581 - 314.697 Zach Furman

We can operationalize this as a form of program synthesis. From observations, attempting to reconstruct some approximation of the true program that generated those observations. Assuming the physical church during thesis, such a learning algorithm would be universal, able to eventually represent and predict any real-world process. But this immediately raises a new problem.

315.759 - 338.64 Zach Furman

For any set of observations, there are infinitely many programs that could have produced them. How do we choose? The answer is one of the oldest principles in science. Occam's razor. We should prefer the simplest explanation. In the 1960s, Ray Solomonoff formalized this idea into a theory of universal induction which we now call Solomonoff induction.

339.702 - 355.079 Zach Furman

He defined the simplicity of a hypothesis as the length of the shortest program that can describe it, a concept known as Kolmogorov complexity. An ideal Bayesian learner, according to Solomonoff, should prefer hypotheses, programs, that are short over ones that are long.

356.16 - 375.985 Zach Furman

This learner can, in theory, learn anything that is computable, because it searches the space of all possible programs, using simplicity as its guide to navigate the infinite search space and generalize correctly. The invention of Solomonov induction began a rich and productive subfield of computer science, algorithmic information theory, which persists to this day.

377.107 - 385.864 Zach Furman

Solomonov induction is still widely viewed as the ideal or optimal self-supervised learning algorithm, which one can prove formally under some assumptions.

386.114 - 407.895 Zach Furman

These ideas, or extensions of them like AI, XI, were influential for early deep learning thinkers like Eugen Schmidhuber and Schoenlig, and shaped a line of ideas attempting to theoretically predict how smarter-than-human machine intelligence might behave, especially within AI safety. Unfortunately, despite its mathematical beauty, Solomonov induction is completely intractable.

408.956 - 431.098 Zach Furman

Vanilla Solomonov induction is incomputable, and even approximate versions like speed induction are exponentially slow. Theoretical interest in it as a platonic ideal of learning remains to this day, but practical artificial intelligence has long since moved on, assuming it to be hopelessly unfeasible. Meanwhile, neural networks were producing results that nobody had anticipated.

432.16 - 455.423 Zach Furman

This was not the usual pace of scientific progress, where incremental advances accumulate and experts see breakthroughs coming. In 2016, most Go researchers thought human-level play was decades away. AlphaGo arrived that year. Protein folding had resisted 50 years of careful work. AlphaFold essentially solved it over a single competition cycle.

455.639 - 479.402 Zach Furman

Large language models began writing code, solving competition math problems, and engaging in apparent reasoning, capabilities that emerged from next token prediction without ever being explicitly specified in the loss function. At each stage, domain experts, not just outsiders, were caught off guard. If we understood what was happening, we would have predicted it. We did not.

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Chapter 1: What is the core hypothesis of deep learning as program synthesis?

Chapter 2: What insights do we gain from examining grokking in neural networks?

Chapter 3: How do vision circuits illustrate deep learning's capabilities?

Chapter 4: What challenges arise when formalizing deep learning processes?

Chapter 5: How does the paradox of approximation affect deep learning?

Chapter 6: What is the significance of generalization in neural networks?

Chapter 7: How do convergence issues relate to deep learning performance?

Chapter 8: What future directions are suggested for research in this area?

Sign in to Audioscrape

Share this moment

LessWrong (Curated & Popular)

"Deep learning as program synthesis" by Zach Furman

Chapter 1: What is the core hypothesis of deep learning as program synthesis?

Chapter 2: What insights do we gain from examining grokking in neural networks?

Chapter 3: How do vision circuits illustrate deep learning's capabilities?

Chapter 4: What challenges arise when formalizing deep learning processes?

Chapter 5: How does the paradox of approximation affect deep learning?

Chapter 6: What is the significance of generalization in neural networks?

Chapter 7: How do convergence issues relate to deep learning performance?

Chapter 8: What future directions are suggested for research in this area?

Want to see the complete chapter?

Sign in to Audioscrape

Share this moment