Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

The Daily AI Show

The Problem With AI Benchmarks

07 Jan 2026

Transcription

Chapter 1: Why are traditional AI benchmarks becoming less effective?

0.031 - 23.564 Beth

Hey, good morning, everybody. It is January 7, 2026, and you are watching or listening to The Daily AI Show. I am Beth, and with me in the studio today is Andy as well. Good morning, Jude. Good morning, Bea. Good morning, Jeff, in the chat.

0

Chapter 2: What challenges arise from measuring AI in real-time environments?

24.105 - 28.671 Beth

So great to see you all. And Andy, how are you doing today?

0

29.174 - 44.903 Andy

Oh, well, I'm nursing a cold, but while I'll sound croaky today, I think there's a lot to discuss. Lots more news coming out of CES and some other exciting developments in the world of AI.

0

46.065 - 49.23 Beth

Absolutely. So hit us with your first story.

0

Chapter 3: How do aggregate metrics fail in AI evaluation?

49.291 - 52.296 Beth

What's out of mind for you today?

0

52.883 - 69.272 Andy

Yeah, there's a couple that are kind of equivalent. First, I just wanted to mention that what we're seeing in the advancement of reasoning in AI models is this idea of recursive inference.

0

69.556 - 93.91 Andy

So that the loop, we discussed this yesterday, and this concept of looping, like the ability to achieve with a smaller model or with even a large model, more accurate results and better reasoning logic by doing recursive inference, returning loops with some kind of memory that it acts on. Well, DeepSeq...

0

94.75 - 105.233 Andy

This China AI startup has just added an interleaved thinking feature to its chatbot, which is this idea.

0

Chapter 4: What role does real-world context play in AI performance?

106.075 - 130.041 Andy

And it performs multi-step research with reasoning interspersed. I love the English terms here that are capturing, again, this looping idea, interspersed throughout the process of the dialogue, assessing information credibility between actions rather than thinking and then generating a finalized response.

0
0

131.183 - 167.569 Andy

And then the other interesting point about DeepSeek is that their monthly active users in December of 2025 is now 132 million. That's a lot of users in the, and from January, they had 34 million. So up to 132 from 34. Reckon that that's, you know, more than four X, right? Oh, and is with us.

0

Chapter 5: Why are hidden failures and edge cases significant in AI systems?

168.09 - 194.476 Anne

Good morning. I had to, I had the ultimate need for the house robot today. Yeah. There we've had a little bit of just in time distribution issues with our coffee accoutrement during this, you know, you know, I mean, we don't know what way is up, really. We took the we're taking this first business week of the year off more or less. I door dashed coffee to my house this morning.

0

194.496 - 199.882 Anne

I live within, I don't know what, 0.4 miles from like, you know, four different Starbucks.

0

Chapter 6: How do perception and interpretation affect AI evaluation?

199.962 - 213.598 Anne

But, you know, it just wasn't coming together. The robot was still sleeping. And now I can start the day. I came here specifically for Andy's news because it has been so good recently.

0

216.521 - 225.957 Andy

That's so good. Well, I hope not to disappoint. Go ahead, Beth. What's your first one?

0

Chapter 7: What is the importance of observability in AI trust?

226.19 - 236.559 Beth

Well, no, you were just saying that the DeepSeek users have increased. And where does that, where is DeepSeek? Is that on Hugging Face? Is that like where?

0

236.599 - 256.095 Andy

No, no. DeepSeek is a model that you can access through a number of sites, whether Poe or you can just go directly to DeepSeek and use that. So around the world, 132 million people are monthly active users. So within a month window.

0

256.075 - 271.333 Andy

Compare this to approximately $1 billion using ChatGPT, which is the far and away leader, but DeepSeek is right up there in terms of eight-month interactive users with the second wave crowd of AI models.

0

271.567 - 313.503 Beth

Right. Part of why I was curious was that because you can use deep seek in so many places, who's counting that? But absolutely, because part of the news that we're seeing is the idea that Like, yes, it's not necessarily the best model, but is second best for almost free just as good, right? We've talked before about, like, are we in a place where you're fine, right?

0

313.644 - 321.635 Beth

For 95% of what you're going to ask something to do, you're fine using a model that is much less expensive.

Chapter 8: How should organizations rethink AI validation as systems scale?

322.196 - 328.629 Beth

Yeah, very cool. All right. Give us your number two. I'll put up my line.

0

328.669 - 346.554 Andy

OK, I'll do another one, which is interesting. So. AI models, and specifically large language models, are used to do spreadsheets, and now they're becoming more and more competent in doing those kinds of financial analysis-type things.

0

347.095 - 363.96 Andy

But when you get to the realm of pure mathematics, where mathematicians are working on proofs and developing new theorems in sort of the ethereal world of mathematics, LLMs are not very impressive.

0

363.94 - 390.218 Andy

And one of the world's biggest mathematicians, a fellow named Joel David Hamkins, has slammed AI models used for solving mathematics and calls them zero and garbage, adding he doesn't find them useful at all. He highlighted AI's frustrating tendency to confidently assert incorrect conclusions and resist correction. They'll argue with him.

0

391.582 - 428.036 Andy

And he said, quote, if I were having such an experience with a person, I would simply refuse to talk to that person again. Now, bring on Axiom Math. Founded by 24-year-old dropout from Stanford, Karina Hong, it raised a $64 million seed round. Wow. Build an AI mathematician. And major investors are behind this, including Graycroft and Menlo Ventures, a couple of VC firms whose names I recognize.

429.158 - 452.723 Andy

And its core architectural idea is to move from generic next token prediction, which creates hallucinations, as we know, in LLMs, and instead use a stack that tightly couples a language modeling algorithm sort of kernel with formal proof systems and programmatic reasoning from mathematics so

453.631 - 478.103 Andy

It's not trained on the broad web and conversational data, so it's not going to spin out and you can't kind of jailbreak it and have it talk about politics or anything. It's a math-specific shell of formal mathematical languages, proof checkers, and verification-driven training signals that goes beyond standard LLMs. So...

478.336 - 505.801 Andy

It doesn't have all those problems, and as a result, because each reasoning step is meant to be checked by a proof engine as it's running, it is virtually free of hallucinations that are common in generic LLM outputs. So, David... You know, check it out. Axiom Math. I mean, you should probably be a consultant to Axiom Math. Maybe he is.

506.864 - 515.903 Andy

But his complaints about LLMs, which are valid, are going to be solved by this new startup, Axiom Math.

Comments

There are no comments yet.

Please log in to write the first comment.