Jaeden Schafer
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Those are kind of like nice things that you would see, for example, in GPT 5.2, like that calculator feature came out.
And what's nice is when GPT 5.3 comes out or even when GPT 6 comes out, that calculator tool is getting built in.
So what's exciting to me is when they make these incremental updates, the 3.1, 3.2, 3.3.
All of the little features, all of the little nice to haves and the things that they're kind of building in are going to get rolled over when the whole model gets an entire overhaul.
So that's what I'm excited about.
They were sharing a bunch of the results from some independent evaluations, a bunch of the benchmarks, especially humanity's last exam.
This is kind of, it feels like the AI models cooked a lot of the benchmarks and just kind of beat them.
And they weren't basically hard enough or built well enough for the AI models.
And so now we've kind of come up with some more challenging ones.
One of the more challenging ones is humanity's last exam.
Gemini 3.1 Pro outperformed Gemini 3.
I mean, obviously, if it didn't, I don't think they'd be releasing it to us, but it did it by a huge margin.
So the model's also coming up, climbing on a bunch of real world performance leaderboards.
This is what I think is actually the most important.
A lot of the leaderboards where people are like, look, we...
Basically, anytime these AI companies can test their own model on a benchmark, it feels like they are cheating, they're being scammy in some way.
And I mean, I don't mean to call the kettle black, but I feel like Anthropic, Google, and OpenAI have all been caught doing some form of this over the last few years.
So I don't really put as much stock in, you know, because basically those really screenshots where they're like, we scored, you know, 72% on this exam and have a screenshot where it's like they skipped out a couple of the questions that it probably didn't do good on.
Anyways, I'm not saying this is Google, but there is an AI company that has done this.
And so when it comes to these companies testing themselves, I trust them a lot less than the real world leaderboards.