Zach Lipton
๐ค SpeakerAppearances Over Time
Podcast Appearances
you know, subset of the population?
Are you sure that it works, you know, with the same accuracy over here?
With something in the generative AI space, evaluation is a little bit more like evaluating the outputs of labor.
And so, for example, if I were to ask you for, like, a particular, like,
you know, young journalist in a firm, like how accurate are they?
And you'd have a whole set of like conceptually fraught questions of like, well, I could, I could try to say, well, like how many out of all the claims,
they make, what fraction of those claims are correct.
But then even like, is there an unambiguous way of mapping from an article to a collection of claims?
So you're going to have to like, like you have a, like the gap between the construct and the measurement is much larger maybe than it would be for, you know, hot dog, not hot dog or for a blood test.
For us, it's very similar.
So this is the pursuit of an entire team within a bridge.
And you'll also see within the big labs, like Anthropic or then OpenAI, you have entire teams that are devoted to accuracy.
Or we don't say, sorry, just accuracy, because accuracy has a literal meaning.
It's like a simple metric for a predictive system, but more broadly to evaluation.
And when we evaluate, what we have is not just a single metric.
What we have is a broad sensorium.
The leaderboard doesn't have one column.
It has 50 columns.
And contributing to that are all kinds of measures.
We care about every kind of pertinent medical detail that is present in the transcript, what fraction are captured in the note.