Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Kevin Weil

👤 Person
437 total appearances

Appearances Over Time

Podcast Appearances

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So one way to think of evals is as a way to measure capabilities and intelligence of models on different dimensions.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So you can have evals around, you know, how good it is at like solving USAMO Math Olympiad style problems and another around how good it is at chemistry and another one about how good it is at creative writing.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

Some, and then also when we're building specific products, I think one of the most effective ways to build products is to take the skill that you want the model to have in order to meet the product need, turn that into an eval so you can actually understand how good you are at it and also how you're getting better over time.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

But one of the fascinating things is the evals that we all used a year ago to measure models,

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

they're all very kind of cut and dried.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

Like you're testing against math.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And with math, there's a right answer.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

You can talk about creative writing evals though.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And with creative writing, there's no answer.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So how do you grade that, right?

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

That's one problem.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

The other is like, as you start to take on more complex tasks, you're not just answering questions.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

you're actually trying to automate some multi-step workflow, there may be ambiguity in the right way to do that.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

If I'm an AI booking a flight for you,

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

There's not a single way to grade which correct flight, you know.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

You also get into these really interesting, challenging, subjective ways of how do we actually grade this particular task?

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And part of having an eval, if you want to at least automate it, is you need to also have a grader for it so that you can very quickly understand how you're doing on that eval.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So it is interesting.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

It's one of the skills that I think is going to be more and more important for PMs over time is the ability to actually create evals for the products that you're building.

Azeem Azhar's Exponential View
OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

I mean, actually, more than people realize, I think.