Nick Heiner
๐ค SpeakerAppearances Over Time
Podcast Appearances
There they are.
But that was giving you noise the whole way you were getting up there.
So one thing we do at Surge is we try to have 100% correctness, 100% tasks that actually work instead of just accepting this degree of noise.
So that's probably my biggest recommendation for people trying to build their own eval sets is to I think there's a certain temptation where it's like building the eval site isn't fun.
Building the agent is what's fun.
Yeah.
But like, yeah, you shouldn't you shouldn't skip your vegetables.
Yeah, I mean, they can be benchmarks, right?
Like at a high level, a benchmark is just a series of challenges for the model and scores.
So RL environments are just a way to do that.
And yeah, in the fullness of time, do most benchmarks become RL environments?
I think it's certainly possible.
You know, it's sort of like in software development where you have your test pyramid, where at the bottom of the pyramid, you have your unit tests, which are very fine grained and give you very specific feedback.
And the top of the pyramid, you have your integration tests, which test the whole system.
And the reason it's shaped like a pyramid is that the integration tests are much more expensive and slow to run.
And when something fails, you don't know exactly what the problem is necessarily.
But they're also way less brittle than the unit tests because they are tracking sort of closer to your end-to-end value.
And so I sort of see different benchmarks as having different spots in that pyramid where like, yeah, you need your RL environments to sort of track like, okay, end of the day, can this thing be a lawyer?
But sometimes you want more specific benchmarks like instruction following or groundedness that will help you sort of tease out like, okay,
My latest model checkpoint had a big regression on the lawyer abilities and it had a big regression on the instruction following abilities.