Andy Halliday
๐ค SpeakerAppearances Over Time
Podcast Appearances
This is a pretty logarithmic scale looks like here.
But now look at Gemini 3.1 Pro Preview.
Not only does it jump above Cloud Opus 4.6 and Cloud Sonnet 4.6 on high, which were kind of really in the sweet spot here, generating really high scores on Arc AGI 2, but at a relatively inexpensive rate.
Now Gemini 3.1 Pro Preview
pushes it back to $1 per task.
And yet goes upwards of, you know, 75, 78% somewhere in there.
So very impressive performance on Arcade GI2 by Gemini 3.1 Pro, which just came out yesterday.
So is it that using a single model, you find that there is a lack of reliability because of inconsistency in the reproducing of results?
Or is it that you're talking about inconsistency across models, which is, oh, this model doesn't reliably produce the things that I need
as compared to apparently the way it reliably produces solutions to the difficult tasks on the Arc AGI prizes.
You'd probably pay more attention then to the x-axis on that RKGI display, which is, hey, did it have to repeat that process, burning tokens repeatedly in order to get to the right solution?
as in the new case of Gemini 3.1 Pro, that it only costs $1 worth of tokens to do a task on AGI 2.
One thing I wanted to point out is that on that prior graphic that I showed from artificial analysis, and artificial analysis is a company, right?
They do these tests against all of these AI models.
Artificial analysis on that scale