Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
Neuralese hasn't won yet.
Chain of thought has a structural advantage, and it's worth understanding why.
When a model reasons in text, what gets stored?
Tokens.
Tokens are small.
You can stall long sequences of them and searching back through them to find what's relevant stays feasible even as the sequence grows.
Need to remember something from step 12?
It's right there.
The model looks back over its own output and finds it instantly.
If the problem is complex and requires long chains of reasoning, just keep generating tokens until you're done.
Neuralese doesn't have this luxury.
To match Chain of Thought's capability, Neuralese needs a scratchpad too.
But internal states are big, that's what makes them rich, and also harder to scale.
At runtime, storing and searching gets expensive, and continuous values accumulate errors in ways discrete tokens don't.
In training, parallelizing across machines is trickier when each state depends on the last.
This is the fundamental trade-off.
Richer representations versus scalability.
Neuralese researchers are exploring different points along this trade-off, trying to find one that outperforms chain of thought.
So far, none have.
Until someone does, the answer to where should the next unit of compute go keeps coming back.