Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Astral Codex Ten Podcast

Next-Token Predictor Is An AI's Job, Not Its Species

02 Apr 2026

Transcription

Chapter 1: What is the main topic discussed in this episode?

0.031 - 20.038 Jeremiah

Welcome to the Astral Codex X podcast for the 26th of February, 2026. Title, Next Token Predictor is an AI's job, not its species. This is an audio version of Astral Codex X, Scott Alexander's Substack. If you like it, you can subscribe at astralcodex10.substack.com. 1.

0

Chapter 2: What is the main argument about AI's capabilities?

21.039 - 45.424 Jeremiah

In The Argument, link in post, Kelsey Piper gives a good description of the ways that AIs are more than just next token predictors or stochastic parrots. For example, they also use fine-tuning and RLHF. But commenters, while appreciating the subtleties she introduces, object that they're still just extra layers on top of a machine that basically runs on next-token prediction. Here's a comment.

0

46.105 - 61.884 Jeremiah

Quote, No, it's just next token prediction on a biased set. If you fine-tune on a set of recipes, it will be more likely to predict recipes. If you fine-tune on answers that have been selected by humans to be typical of a helpful assistant, it will be more likely to predict text characteristics of a helpful assistant.

0

62.705 - 90.46 Jeremiah

Next token prediction is a structural property, the structural property, of what these models are. It can't be changed by fine-tuning. Scott writes, I want to approach this from a different direction. I think overemphasizing next token prediction is a confusion of levels. On the levels where AI is a next token predictor, you are also a next token, technically next sense datum predictor.

0

91.142 - 112.583 Jeremiah

On the levels where you're not a next token predictor, AI isn't one either. Putting all the levels in graphic form, this is a chart that shows two sequences of steps, one for human and one for LLM, with the comparable levels next to each other. So we have evolution, sex and reproduction, for a human, and for an LLM it's incentives, AI company profit motive.

0

113.825 - 117.631 Jeremiah

For the human we have predictive coding, next sense datum prediction.

Chapter 3: How do fine-tuning and RLHF contribute to AI performance?

118.653 - 141.711 Jeremiah

LLM has training, next token prediction. Then for a human, next we have, I don't know, I just thought about it really hard. And then an LLM has question mark, maybe nothing? Then the human has, for example, high D toroidal attractor manifolds. And for an LLM, for example, rotation of 6D helical manifolds. And then for the human, finally we have neurons and neurotransmitters.

0

142.271 - 163.996 Jeremiah

And for the LLM we have chips and electricity. 2. The human brain was designed by a series of nested optimization loops. The outermost loop is evolution, which optimized the human genome for being good at survival, sex, reproduction, and child-rearing. But evolution can't encode everything important in the genome.

0

164.717 - 186.52 Jeremiah

It obviously can't include individual and cultural features like the vocabulary of your native language, or your particular mother's face. But even a lot of things that could be there in theory, like how to walk, or which animals are most nutritious, are missing. The genome is too small for it to be worth it. Instead, evolution gives us algorithms that let us learn from experience.

0

187.023 - 229.095 Jeremiah

These algorithms are a second optimization loop, evolving in quotes neuron patterns into forms that better promote fitness, reproduction, etc. The most powerful such algorithm is called predictive coding, which neuroscience increasingly considers a key organizing principle of the brain. Wikipedia describes it as, quote, End quote.

0

230.24 - 248.68 Jeremiah

Scott writes, in other words, the brain organizes itself and learns things by constantly trying to predict the next sense datum, then updating synaptic weights towards whatever form would have predicted the next sense datum most efficiently. This is very close, not exact, analog to the next token prediction of AI.

250.122 - 269.789 Jeremiah

This process organizes the brain into a form capable of predicting sense data called a world model. For example, if you encounter a tiger, the best way of predicting the resulting sense data, the appearance of the tiger pouncing, the sound of the tiger's roar, the burst of pain at the tiger's jaws closing around your arm, is to know things about tigers.

271.712 - 288.947 Jeremiah

On the highest and most abstract levels, these are things like tigers are orange, tigers often pounce, and tigers like to bite people. On lower levels, they involve the ability to translate high-level facts like tigers often pounce into a probabilistic prediction of the tiger's exact trajectory.

289.848 - 313.551 Jeremiah

All of this is done via neural circuits we don't entirely understand and implemented through the usual neuroscience stuff like synapses and neurotransmitters. To you, it just feels like, I don't know, I thought about it and I realized the tiger would pounce over there. Here's a picture of a tiger, ready to pounce. 3. The AI's equivalent of evolution is the AI companies designing them.

314.432 - 331.78 Jeremiah

Just like evolution, the AI companies realized that it was inefficient to hand-code everything the AIs needed to know – giant look-up table – and instead gave the AIs learning algorithms, deep learning. As with humans, the most powerful of these learning algorithms was next-token prediction –

Chapter 4: What confusion arises from overemphasizing next-token prediction?

422.401 - 439.228 Jeremiah

CF, organisms are adaption executors, not fitness maximizers. Link in post. Which does a good job hammering in the point that we run algorithms designed by the evolutionary imperative to maximize survival and reproduction. rather than considering survival and reproduction explicitly in our decisions.

0

440.31 - 459.487 Jeremiah

When a monk decides to swear an oath of celibacy and never reproduce, he does so using a brain that was optimized to promote reproduction, just using it very far out of distribution, in an area where it no longer functions as intended. One level lower down, your brain was shaped by next-sense-dartum prediction.

0

460.248 - 478.471 Jeremiah

Partly you learned how to do addition because only the mechanism of addition correctly predicted the next word out of your teacher's mouth when she said 3 plus 3 is. It's more complicated than this, sorry, but this oversimplification is basically true. You don't feel like you're predicting anything when you're doing a math problem.

0

478.831 - 498.657 Jeremiah

You're just doing good, normal, mathematical steps, like reciting PEMDAS to yourself and carrying the one. In the same way, even though an AI was shaped by next token prediction, the inside of its thoughts doesn't look like next token prediction. In the abstract, it probably looks like a world model, the same as yours. In the concrete...

0

500.257 - 518.977 Jeremiah

The science of figuring out what an AI's innards are concretely doing is called mechanistic interpretability. It's very hard to do. AI innards are notoriously confusing, and one team at Anthropic produces most of the headline results. Recently, they explored how Claude predicts where a line break will be in a page of text.

519.758 - 542.442 Jeremiah

Since line break is a token, this is literally a next token prediction task. Here's a diagram. It's captioned, Key steps in the line-breaking behavior can be described in terms of the construction and manipulation of manifolds. So there's a series of sub-diagrams in here. The first is captioned, LLMs perceive visual properties of text despite only seeing a list of numbers.

544.384 - 572.831 Jeremiah

So it shows line-wrapped text with various words and a line break, and then it says what the model sees. It's just a long list of numbers. How are different character counts represented? Character count and line width manifolds are aligned and discretized by features. How is the boundary detected? Boundary heads use QK to shift offsets of line count and width manifolds. Will the next word fit?

574.274 - 588.557 Jeremiah

Final prediction is made by arranging representations to be linearly separable. And how are representations constructed? Multiple attention heads specialized in particular token ranges. Their sum creates a line count manifold.

590.199 - 612.951 Jeremiah

Scott writes, The answer was, the AI represents various features of the line breaking process as one-dimensional helical manifolds in a six-dimensional space, then rotates the manifolds in some way that corresponds to multiplying or comparing the numbers that they're representing. You don't need to understand what this means, so I've relegated my half-hearted attempt to explain it to a footnote.

Chapter 5: How does the human brain compare to AI in terms of optimization?

934.22 - 971.173 Jeremiah

I'm hoping to combine all my writing on this into an anti-stochastic parrot FAQ, so don't worry if I don't immediately rebut all of them in this post. Thank you for listening, and I'll speak to you next time.

0
Comments

There are no comments yet.

Please log in to write the first comment.