Andy

The Problem With AI Benchmarks

AI models, and specifically large language models, are used to do spreadsheets, and now they're becoming more and more competent in doing those kinds of financial analysis-type things.

335.337 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

But when you get to the realm of pure mathematics, where mathematicians are working on proofs and developing new theorems in sort of the ethereal world of mathematics, LLMs are not very impressive.

347.095 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

And one of the world's biggest mathematicians, a fellow named Joel David Hamkins, has slammed AI models used for solving mathematics and calls them zero and garbage, adding he doesn't find them useful at all.

363.94 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

He highlighted AI's frustrating tendency to confidently assert incorrect conclusions and resist correction.

380.083 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

They'll argue with him.

388.976 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

And he said, quote, if I were having such an experience with a person, I would simply refuse to talk to that person again.

391.582 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

Now, bring on Axiom Math.

398.53 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

Founded by 24-year-old dropout from Stanford, Karina Hong, it raised a $64 million seed round.

405.598 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

Wow.

416.992 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

Build an AI mathematician.

417.593 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

And major investors are behind this, including Graycroft and Menlo Ventures, a couple of VC firms whose names I recognize.

420.404 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

And its core architectural idea is to move from generic next token prediction, which creates hallucinations, as we know, in LLMs, and instead use a stack that tightly couples a language modeling algorithm

429.158 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

sort of kernel with formal proof systems and programmatic reasoning from mathematics so

445.885 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

It's not trained on the broad web and conversational data, so it's not going to spin out and you can't kind of jailbreak it and have it talk about politics or anything.

453.631 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

It's a math-specific shell of formal mathematical languages, proof checkers, and verification-driven training signals that goes beyond standard LLMs.

464.826 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

So...

477.142 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

It doesn't have all those problems, and as a result, because each reasoning step is meant to be checked by a proof engine as it's running, it is virtually free of hallucinations that are common in generic LLM outputs.

478.336 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

So, David...

496.563 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

You know, check it out.

498.586 View full episode →

The Daily AI Show

The Problem With AI Benchmarks

Axiom Math.

500.951 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment