Sholto Douglas
👤 PersonAppearances Over Time
Podcast Appearances
OK.
So I think the biggest thing that's changed is RL and language models has finally worked.
And this is manifested in we finally have proof of an algorithm that can give us expert human reliability and performance given the right feedback loop.
And so I think this is only really being conclusively demonstrated in competitive programming and math, basically.
And so if you think of these two axes, one is the intellectual complexity of the task, and the other is the time horizon of which the task is being completed on.
And I think we have proof that we can reach the peaks of intellectual complexity along many dimensions.
But we haven't yet demonstrated long-running agentic performance.
And you're seeing the first stumbling steps of that now and should see much more conclusive evidence of that basically by the end of the year with real software engineering agents doing real work.
And I think, Trenton, you're like,
I think this is roughly on track for where I expected with software engineering.
I think I expected them to be a little bit better at computer use.
Yeah.
But I understand all the reasons for why that is, and I think that's well on track to be solved.
It's just a temporary lapse.
And holding me accountable for my predictions next year, I really do think end of this year, sort of like this time next year, we have software engineering agents that can do close to a day's worth of work for a junior engineer or a couple of hours of quite competent independent work.
I think my description there was I think like in retrospect probably not what's limiting them.
I think what we're seeing now is closer to lack of context.
lack of ability to do complex, very multi-file changes and maybe scope of the change or scope of the task in some respects.
They can cope with high intellectual complexity in a focused context with a really scoped problem.
But when something's a bit more amorphous or requires a lot of discovery and iteration with the environment, this kind of stuff, they struggle more.