Jaeden Schaefer
๐ค SpeakerAppearances Over Time
Podcast Appearances
A lot of the leaderboards where people are like, look, we like basically anytime these AI companies can can test their own model.
on a benchmark it feels like they are cheating they're being scammy in some way and i mean i don't mean to call the kettle black but i feel like anthropic google and opening i have all been caught doing some form of this over the last few years so i don't really put as much stock in you know because basically those really screenshots where they're like we scored you know 72 on this exam and have a screenshot where it's like they they skipped out a couple of the questions that it
probably didn't do good on.
Anyways, I'm not saying this is Google, but there is an AI company that has done this.
And so when it comes to these companies testing themselves, I trust them a lot less than the real world leaderboards.
So some of those examples are when essentially they have side by side comparisons of their model versus another model.
And they just give have people give blind
They vote blindly on which response they like better.
And when a new model really starts crushing it on those type of leaderboards, I take stock in that because this is actual people blind testing saying that their model is better.
So that's great.
One of these kind of real world leaderboards comes from a company called Merkur, their CEO, who's Brendan Foody.
said he was posting about this.
He says that Gemini 3.1 Pro is now the number one company on, they have a leaderboard called the Apex Agents Leaderboard.
It's basically a benchmark that is designed to measure how well these AI systems handle professional knowledge-based tasks.
And he says that this is, I mean, basically just showing how quickly this can move into a lot of the systems that agents are using to improve real work.
So what's interesting to me is it feels like Google's putting a lot of stock in kind of this knowledge-based tasks field.
They're doing a lot with education.
And it seems like it's paying off in the benchmarks.
With this whole release, this is obviously really heating up the competition, OpenAI, Anthropic.
Everyone's rolling out new systems and it feels like they're only months apart.