Cal Newport
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So you needed the arrival of mature software development tools built on LLMs before you could even do these tasks.
So basically what this graph is showing is that like the very first software development coding harnesses they introduced in the fall of 2025 couldn't solve the hardest type of problems.
And then they sort of fixed that the next year.
All right.
There's one final chart in this paper that they're pointing at to justify their concerns.
I'll put it on the screen here.
The title is, Where Researcher Went Wrong, Could Claude Have Done Better?
And we see here a bunch of different models of Claude and percentage bars.
And back for these early models of Claude, you know, we were getting like 50 or 45%.
And now with the very newest versions of Claude, like Opus 4.7 and Claude Mythos,
we're at like 59 to 64%.
So we've got like a 10 or 15% improvement on that measure.
What is this measure?
It's a little bit complicated.
Essentially, they have these transcripts of programmers working on programmer-style tasks.
And what they're looking for is an example where there's some problem that the programmer is trying to solve and they take a wrong turn.
So they go down some path to explore something that turns out not to be
the correct source of the problem.
So what they would do is take the transcript of this session right up to the point where the human was about to try to explore something wrong.
They fed this transcript to one of these mature coding harnesses on top of an LLM and said, hey, what do you think we should do next?