Andy
👤 PersonAppearances Over Time
Podcast Appearances
And, you know, including in the Math Arena Apex mathematics test, Gemini 3 Pro scored 23.4%, which was...
compared to GPT-5.1's 1%.
That's how big an advantage Google Gemini 3 Pro set that way.
And then in addition to that, Gemini 3 Pro, again, a slightly larger model than just the standard Gemini 3 that comes out,
Gemini 3 Pro, and I'm not sure how to parse those because there's also Gemini 3 Deep Think, which is the winner in the Arc AGI 2 thing.
I'm going to talk about that in a second.
But Gemini 3 Pro set a new high score for AI models in the tracking AI's offline IQ test, meaning it cannot use search or anything else.
It has to use its internal reasoning capabilities.
and it surpassed grok for expert modes 126 achieving 130 uh point iq on a reasoning test basically so that's that's google gemini 3. so then anthropics coming back strong with claude opus 4.5
It reclaimed the top spot on key coding benchmarks like SWE Bench Verified, software engineering.
81%, edging out Gemini 3 Pro's 76%.
So significantly better than Gemini 3 Pro.
And at the same time,
The important takeaway for Claude Opus 4.5 is they cut their prices.
Opus was their top model before, and it was expensive in the 4.1.
The new Opus 4.5 is a third the cost of what Opus 4.1 was before.
So they're making it affordable, and it is the top model.
a state-of-the-art model when it comes to coding benchmarks benchmarks okay but then open ai came out with codex max using 5.1 and that one introduced this com context compaction technology that allows for breakthroughs in the context continuity of authentic work using
Basically, there's no end to how long this thing can go on reasoning without losing track of what it's doing.