Nick Heiner
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
So essentially the structure of an eval is like a set of golden answers where you have tasks and then you have what the expected outcome is.
And as we were talking about earlier, the more interesting the task, the harder it is to construct that.
Yeah.
Because like the more open ended the evaluation is.
And yeah, that does become substantially difficult.
And frankly, like many, many golden sets are wrong.
Right.
Like it just it takes a lot of effort.
And again, if you have like noise, then that really disrupts your your development process.
You know, you'll see when labs release new models, they talk about their benchmark scores.
And sometimes you'll see it like for a certain benchmark, everything will cluster around 80%.
And people will start to say, oh, the benchmark is saturated now.
And when they say saturated, what they mean is there's nothing more for us to learn.
Like the model is sort of as good as it's going to get.
And sometimes it's because the benchmark has like a long tail of like really hard things.
But sometimes it's because a lot of the tasks are just broken.
And you start with the benchmark and you're like, okay, I expect 20% of these are busted.
I just don't know which 20.
And then you train your model, you get to 80% and you're like, oh, those are the 20.