Andy Halliday
π€ SpeakerAppearances Over Time
Podcast Appearances
And it's a curated main set of questions for model evaluation.
So that's the GPQA.
This is the stat is also just for GPQA, not the GPQA diamond, which is another higher level of the graduate level Google proof QA benchmark that has a little harder set of questions.
But the comparison across Lama, Gemma, Quinn is going to be for that GPQA.
Okay.
Llama 3.2 does 17 on that benchmark, 17.
Gemma does 24.
Quen 3, with double the size, roughly, does 35.
And LFM 2 at 1.2 billion, LFM 2.5 from liquid.ai does 39.
OK, OK.
But on your phone, this is a pretty big jump in my view.
So now let's look at another another important test, which is NNLU Pro.
So that is the massive multitask language understanding benchmark.
And the pro version of that was created as a harder set of those.
test questions in effect.
It was, it was set up because most of the models, even the open source, smaller models were getting too good at NMLU original.
So now NMLU pro.
So how do these compare now?
Okay.