Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, which means engaging with the task for many hours or even many weeks or months where if I have, I don't know, an assistant or an employee or something, they can just do a thing and tell them for a while.
And AI agents haven't taken off for this reason, from what I understand.
So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with
an assignment for many hours?
Or are these unrelated concepts?
And there are ways that you can find a smooth metric for that.
Yeah, human eval or whatever, in the GPT-4 paper, the coding problems, they measure it by... Log pass, right?
Exactly.
For the audience, the context on this is...
You it's basically the idea is you want to when you're measuring how much progress there has been on a specific task, like solving coding problems, you you upweighted when it gets it right.
Only one in a thousand times.
You don't give it a one in a thousand score because it's like, oh, like got to write some of the time.
And so the curve you see is like it gets it right.
One in a thousand, then one in a hundred, then one in ten and so forth.
So actually, I want to follow up on this.
So if your claim is that the AI agents haven't taken off because of reliability rather than long horizon task performance, isn't the
Lack of reliability when a task is changed on top of another task on top of another task.
Isn't that exactly the difficulty with long horizon tasks?
Is that like you have to do 10 things in a row or 100 things in a row and diminishing the reliability of any one of them?
Or the probability goes down from 99.99 to 99.9.