Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sholto Douglas

๐Ÿ‘ค Speaker
1567 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, which means engaging with the task for many hours or even many weeks or months where if I have, I don't know, an assistant or an employee or something, they can just do a thing and tell them for a while.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And AI agents haven't taken off for this reason, from what I understand.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

an assignment for many hours?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Or are these unrelated concepts?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And there are ways that you can find a smooth metric for that.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, human eval or whatever, in the GPT-4 paper, the coding problems, they measure it by... Log pass, right?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Exactly.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

For the audience, the context on this is...

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You it's basically the idea is you want to when you're measuring how much progress there has been on a specific task, like solving coding problems, you you upweighted when it gets it right.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Only one in a thousand times.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You don't give it a one in a thousand score because it's like, oh, like got to write some of the time.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so the curve you see is like it gets it right.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

One in a thousand, then one in a hundred, then one in ten and so forth.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So actually, I want to follow up on this.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So if your claim is that the AI agents haven't taken off because of reliability rather than long horizon task performance, isn't the

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Lack of reliability when a task is changed on top of another task on top of another task.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Isn't that exactly the difficulty with long horizon tasks?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Is that like you have to do 10 things in a row or 100 things in a row and diminishing the reliability of any one of them?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Or the probability goes down from 99.99 to 99.9.