Rob Wiblin
๐ค SpeakerAppearances Over Time
Podcast Appearances
But as 2025 wore on, it became apparent that the same thing wasn't really happening here with reasoning generalization.
The reasoning models that had been optimized to reason were a lot better at math and logic and coding, but they weren't suddenly able to extrapolate from that to go and book you a flight, say, or go away and organize an event that actually works.
A senior staff member at an AI company recently told me that for them, this overall experience actually updated them towards longer timelines to artificial general intelligence.
Because until then, that possible generalization from easily checkable to non-checkable domains had been one plausible path to really rapid, perhaps unexpectedly rapid, capabilities gains.
And now he saw that that was basically ruled out.
A lot of people think that this autonomy issue is changing right now with the arrival of Anthropic's Claude Opus 4.5 and Claude Code and as of January 2026, Claude Cowork.
But even if that does pan out as much as people are hoping and some people are expecting, it'll be because Anthropic trained its models on these autonomy tasks specifically, not because of magical generalization from reasoning tasks that we had originally hoped would get us those kinds of capabilities sort of for free.
So there are two big ways that reasoning models got so much better at solving a particular kind of problem.
One was actually being better at reasoning, you know, basically being smarter and more logical and able to maintain a thread for a decent amount of time.
And the other was that they were given much longer to think through each question they were asked, where previous models had mostly not been given thinking time at all.
They'd more or less just had to blurt out the first thing that popped into their head when they were asked a question.
Early on with reasoning models, it was kind of hard to tell how much work was being done by the first thing, actually being smarter, versus the second thing, so-called inference scaling.
And that question mattered enormously because it's really expensive to have LLMs think for a lot longer every single time you want them to answer something.
The thing is, we could get this big proportional increase in thinking time in 2024 and 2025 because we'd basically been starting from zero.
But while it was affordable to go from giving models zero minutes of thinking time up to one minute of thinking time, there actually aren't enough computer chips in the world to go on and give models 10 minutes or 100 minutes to think in order to give slightly better answers, at least not for normal situations.
And that fact would make the gains from increasing thinking time more of a one-off, not a trend that could carry on from 2024 to 2025 into 2026 and 2027.
And indeed, it does appear that more than two thirds of the improved performance of reasoning models came from giving them more time to think.
And unfortunately, that's not a trick we can pull off again, not until we go away and actually manufacture more and faster computer chips, which does happen, but doesn't happen at anywhere near the rate that we were able to scale up thinking time before.
And this realization meant that analysts came to expect slower improvements in AI capabilities going forward.
Past guest of the show, Toby Ord, has done the best public analysis of this that I'm aware of.