Kevin Weil
👤 PersonAppearances Over Time
Podcast Appearances
So one way to think of evals is as a way to measure capabilities and intelligence of models on different dimensions.
So you can have evals around, you know, how good it is at like solving USAMO Math Olympiad style problems and another around how good it is at chemistry and another one about how good it is at creative writing.
Some, and then also when we're building specific products, I think one of the most effective ways to build products is to take the skill that you want the model to have in order to meet the product need, turn that into an eval so you can actually understand how good you are at it and also how you're getting better over time.
But one of the fascinating things is the evals that we all used a year ago to measure models,
they're all very kind of cut and dried.
Like you're testing against math.
And with math, there's a right answer.
You can talk about creative writing evals though.
And with creative writing, there's no answer.
So how do you grade that, right?
That's one problem.
The other is like, as you start to take on more complex tasks, you're not just answering questions.
you're actually trying to automate some multi-step workflow, there may be ambiguity in the right way to do that.
If I'm an AI booking a flight for you,
There's not a single way to grade which correct flight, you know.
You also get into these really interesting, challenging, subjective ways of how do we actually grade this particular task?
And part of having an eval, if you want to at least automate it, is you need to also have a grader for it so that you can very quickly understand how you're doing on that eval.
So it is interesting.
It's one of the skills that I think is going to be more and more important for PMs over time is the ability to actually create evals for the products that you're building.
I mean, actually, more than people realize, I think.