Andy Halliday
๐ค SpeakerAppearances Over Time
Podcast Appearances
So notice here, this is also compacted, right?
Zero to 70% accuracy is compressed into this little slip in the bottom.
So if you saw these in their full extent, you'd see that they're all pretty close.
They're all pretty close.
The worst Opus 4.1 is at 75 percent.
And, you know, the next one's at 81 percent.
So look at GPT 5.1 Codex Max.
That was phenomenal.
which just came out recently after gemini 3.0 pro you see it goes 76.2 let's let's round them 76 it goes to 78 with gpt 5.1 codex max and then opus 4.5 jumps to 81. those are big numbers at the margin right but again if you look at the competence of any one of these coding ages they're all damn good
I'm not going to decide to use Codex Max just because it bested Gemini 3 Pro by a little over a point.
There's other factors that blend into that.
And a lot of them have to do with the personality and the style of the coding.
And Nate Jones has done a really nice comparison of Codex Max.
And GPT-3, I think it was, he compared the two and showed that there are actual differences there in the style, the approach that those models take, not just do they solve the coding problems on the benchmark.
I have some numbers on that.