Kevin Weil

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So one way to think of evals is as a way to measure capabilities and intelligence of models on different dimensions.

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So you can have evals around, you know, how good it is at like solving USAMO Math Olympiad style problems and another around how good it is at chemistry and another one about how good it is at creative writing.

1533.209 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

Some, and then also when we're building specific products, I think one of the most effective ways to build products is to take the skill that you want the model to have in order to meet the product need, turn that into an eval so you can actually understand how good you are at it and also how you're getting better over time.

1554.882 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

But one of the fascinating things is the evals that we all used a year ago to measure models,

1574.525 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

they're all very kind of cut and dried.

1580.712 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

Like you're testing against math.

1582.535 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And with math, there's a right answer.

1585.019 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

You can talk about creative writing evals though.

1587.202 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And with creative writing, there's no answer.

1589.766 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So how do you grade that, right?

1592.09 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

That's one problem.

1593.292 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

The other is like, as you start to take on more complex tasks, you're not just answering questions.

1595.015 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

you're actually trying to automate some multi-step workflow, there may be ambiguity in the right way to do that.

1600.844 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

If I'm an AI booking a flight for you,

1608.653 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

There's not a single way to grade which correct flight, you know.

1611.779 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

You also get into these really interesting, challenging, subjective ways of how do we actually grade this particular task?

1615.984 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

And part of having an eval, if you want to at least automate it, is you need to also have a grader for it so that you can very quickly understand how you're doing on that eval.

1623.193 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

So it is interesting.

1631.903 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

It's one of the skills that I think is going to be more and more important for PMs over time is the ability to actually create evals for the products that you're building.

1633.165 View full episode →

Azeem Azhar's Exponential View

OpenAI’s CPO on what’s coming next: Hardware, GPT-5, Jony Ive, agents, more

I mean, actually, more than people realize, I think.

1685.861 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment