Logan Kilpatrick
👤 PersonAppearances Over Time
Podcast Appearances
single-digit numbers and months.
Yeah, all evals have this expiration date.
And that's actually what's interesting about these online arena type of evals, which is what you're actually comparing is like models versus one another.
So as the model progress continues, you actually, the puck is always moving.
So relative to static evals, it becomes really hard.
Eventually you'd imagine that some of the arenas would saturate and there's not really, you'd get to this like,
global optima where the models don't actually make, it's not interesting anymore, but I think they have a much longer lifespan than sort of static evals.
So there's something really cool there.
It's also meant to show, and this was part of the conversation with Demis, I don't know if this part was on camera or not, but we were talking off camera and it was about
the, you know, people talk a lot about like how close we are to AGI and all this other stuff.
And obviously there's been like incredible amount of progress on the model side and they're all getting smart and things like that.
But then you look at these examples where with domain specific systems and with humans,
were really easily able to generalize across all these different games.
And you look at the models and the models can't follow the basic instructions of chess.
The models want to make all these illegal moves.
They want to do all this stuff that just isn't
how to actually play the game and is a good reminder that like, we're obviously making progress, but we're not, we're not at AGI yet.
Um, right.
And there's this like jagged intelligence paradigm where like the models are really can generate the code for Minecraft, uh, or some, something that looks or feels like Minecraft.
And yet at the same time, can't follow the basic instructions of chess, which I could probably teach a, you know, a 12 year old who's