Logan Kilpatrick
๐ค SpeakerAppearances Over Time
Podcast Appearances
There were a couple of different dimensions, the goal for Game Arena.
One, it's just like cool to see models play games.
And there was lots of people like Magnus Carlsen and others sort of commentating and watching all the chess games specifically that were happening, which is cool.
Um, there's definitely an entertainment value of it, but two, there's this, uh, there's this thread around evals and for folks who haven't built or aren't following this closely, like the challenge with evals right now is the saturation happens so quickly.
Um,
You like even, you know, humanity's last exam, which I'll not talk deeply about, like what's what it's actually testing, but was designed to be really rigorous and difficult to eval that it would take a long time in order to actually have models solve.
And I think already today it's like you're seeing models jump from like zero or one percent to 40 or 50 percent on the order of.
single-digit numbers and months.
Yeah, all evals have this expiration date.
And that's actually what's interesting about these online arena type of evals, which is what you're actually comparing is like models versus one another.
So as the model progress continues, you actually, the puck is always moving.
So relative to static evals, it becomes really hard.
Eventually you'd imagine that some of the arenas would saturate and there's not really, you'd get to this like,
global optima where the models don't actually make, it's not interesting anymore, but I think they have a much longer lifespan than sort of static evals.
So there's something really cool there.
It's also meant to show, and this was part of the conversation with Demis, I don't know if this part was on camera or not, but we were talking off camera and it was about
the, you know, people talk a lot about like how close we are to AGI and all this other stuff.
And obviously there's been like incredible amount of progress on the model side and they're all getting smart and things like that.
But then you look at these examples where with domain specific systems and with humans,
were really easily able to generalize across all these different games.