Andy Halliday
๐ค SpeakerAppearances Over Time
Podcast Appearances
and regular Opus 4.6 at 64.
So that's a pretty big spread back to Gemini 3.1 Pro Preview.
What does that mean?
For most of us, probably not a lot, unless you're actually using the entire Google ecosystem to set up a harness for multi-agent coding or multi-agent workflow management.
You might want to...
not focus entirely on Gemini 3.1 Pro Preview.
Make Opus available to whatever system you're architecting.
Now, a lot of it depends on memory and the ability to do either recursion, where you process things while creating an intermediate capture of what the results of that are, and then opening up the context window again with new inference based on that new context and so on.
So that's this kind of iterative looping process using a Python REPL,
Let me share, whenever a new model comes out, there's a couple of different benchmarks I go immediately to.
One is that artificial analysis index.
And the other one is ArcAsia 2.0.
And this is the thing that's measuring where these players are in the progression towards an effective accomplishment of artificial general intelligence.
And these are really complex reasoning tasks that make it hard for any of the models to really get right.
significantly above 60%, 70%, which you see where the players were just recently.
Although Gemini 3 DeepThink that came out on February 20th, it's still earlier this month, but it's already at a high expense level.
This x-axis down here is cost per AGI...
And you can see that GP 5.2 was spending upwards of, you know, $70 per task.
And Gemini 3 DeepThink was spending, you know, something in the order of, let's say, $20 a task.