Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nick Heiner

๐Ÿ‘ค Speaker
529 total appearances

Appearances Over Time

Podcast Appearances

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

There they are.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

But that was giving you noise the whole way you were getting up there.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

So one thing we do at Surge is we try to have 100% correctness, 100% tasks that actually work instead of just accepting this degree of noise.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

So that's probably my biggest recommendation for people trying to build their own eval sets is to I think there's a certain temptation where it's like building the eval site isn't fun.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

Building the agent is what's fun.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

Yeah.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

But like, yeah, you shouldn't you shouldn't skip your vegetables.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

Yeah, I mean, they can be benchmarks, right?

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

Like at a high level, a benchmark is just a series of challenges for the model and scores.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

So RL environments are just a way to do that.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

And yeah, in the fullness of time, do most benchmarks become RL environments?

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

I think it's certainly possible.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

You know, it's sort of like in software development where you have your test pyramid, where at the bottom of the pyramid, you have your unit tests, which are very fine grained and give you very specific feedback.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

And the top of the pyramid, you have your integration tests, which test the whole system.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

And the reason it's shaped like a pyramid is that the integration tests are much more expensive and slow to run.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

And when something fails, you don't know exactly what the problem is necessarily.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

But they're also way less brittle than the unit tests because they are tracking sort of closer to your end-to-end value.

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

And so I sort of see different benchmarks as having different spots in that pyramid where like, yeah, you need your RL environments to sort of track like, okay, end of the day, can this thing be a lawyer?

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

But sometimes you want more specific benchmarks like instruction following or groundedness that will help you sort of tease out like, okay,

The Neuron: AI Explained
Inside the Secret Labs Where AI Learns to Work

My latest model checkpoint had a big regression on the lawyer abilities and it had a big regression on the instruction following abilities.