Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Azeem Azhar's Exponential View

2025 AI reality check: Are we in a bubble?

09 Jul 2025

Transcription

Full Episode

0.031 - 25.549 Azeem Azhar

So let's get started, though, with marking my work for this year. So the first prediction was that there would be no AI wall. There was this battle between whether scaling AI models was still a strategy that would work, that would deliver results, whether they could be actually even built to that size or not. My second prediction was about the speed of deployments.

0

25.529 - 44.245 Azeem Azhar

making certain predictions about how fast things would spread and what would happen to the price of tokens. I made a prediction that bots would out-talk humans this year in the production of natural language. I also predicted that Waymo would overtake Uber in San Francisco.

0

44.265 - 68.968 Azeem Azhar

I noted that I expected climate extremes to intensify significantly, and that alongside this, and despite the change in the political environment, renewable deployment, particularly of solar, would continue to surprise to the upside. And alongside that, that again, despite changes to the political environment, electric vehicles would significantly shift up a gear in their markets.

0

68.948 - 94.18 Azeem Azhar

So there were seven predictions, and I had some watch list themes around geopolitical volatility, demographic decline, and climate and capital, which I didn't have strong tests against. But let's start with the first one, which was that there would be no AI wall. And I said, look, research is accelerating, not plateauing. And we would likely see a 10 million token context model

0

94.16 - 117.484 Azeem Azhar

and reasoning breakthroughs across some of these reasoning benchmarks. Now, both of those matter because the context window of a model is a little bit like its working memory. It's a bit shonky as a working memory, but it's the bit that you put into your LLM and it can manipulate back and forth. And when you get to the end of the context window, it tends to hallucinate much, much more.

117.464 - 137.323 Azeem Azhar

And the reasoning model, the reasoning tests like Frontier and RKGI are very useful tests for whether a model is capable not just of producing text, but also reasoning through problems. Now, in both cases, I asked ChatGPT03 to mark my work And it said, you are accurate.

137.563 - 161.332 Azeem Azhar

There's been no slowdown in AI progress, that ultra-long context models have emerged, and AI models are shattering reasoning benchmarks. And that's true. I mean, there is this 10 million token context window with Meta's Lama Force Scout model. But I want to say that even though that model was released, Meta had had all sorts of issues with its Lama models and there were problems at the time.

161.372 - 182.677 Azeem Azhar

So Wildchat GPT has very gracefully given me that as the evidence point that I was right. I don't think that's a best evidence point. There are a couple of other models out there, Megadev, and I believe Gemini has one in testing, which are up at that 10 million level. And at the same time, O3 itself has really, really pushed its benchmark scores.

183.118 - 208.834 Azeem Azhar

So on the frontier math test, which GPT-4 used to score 2% on, O3 got 25%. And I think that this is clearly what we are seeing. And if you're using the tools, you have noticed that they've got better. Although, as people also know, they're really, really quite unstable. Something that happened with our team was that we had standardized on Gemini Flash 2.0 for some of our internal workflows.

Comments

There are no comments yet.

Please log in to write the first comment.