Azeem Azhar
👤 SpeakerAppearances Over Time
Podcast Appearances
So what you looked at in the report, one observation was that when people used the API, the system could do a 3.5 hour task, long task.
In other words, a task a human might take 3.5 hours to do with a 50% success rate.
But if they were using Claw.ai, that 50% success threshold was a 17-hour task, which is two workdays in Europe and, of course, one workday in the Bay Area.
I'm just quite curious about a few details there, right?
So what is a 17-hour human task and how much time were the humans having to do it?
Because it's really different if I just, you know, bark a command at Claude, who will rid me of this turbulent priest?
and it goes off and does it for 17 hours versus I have to sit over the machine and every 30 minutes give it feedback.
Because in that case, I'm context switching and perhaps I can't really do other things.
So is it a 17 hour task or am I just pressing shift tab, accept, accept on my keyboard for the better part of a day?
And what is it about these multi-turn conversations that single shot can't achieve?
So I'm just curious.
So what is that task?
How much is a human sitting over the machine pressing a button?
It's quite a complicated picture because only a certain type of person with a certain experience can command a 17-hour or longer task.
Only a certain experienced person can make sense of whether a literature review is good or not.
It's why PhD students have supervisors.
And so this is a tension in your findings I think is worth exploring.
Models are improving and people are trusting them with longer and harder tasks.
But judging those outputs does require real expertise.
But there is this hollowing out because many of the junior tasks that have built that expertise are