Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Deep Questions with Cal Newport

Is AI About to “Eat Everything”? | AI Reality Check

14 May 2026

Transcription

Chapter 1: What is the main topic discussed in this episode?

0.031 - 15.533 Cal Newport

Last week, the AI Safety and Evaluation Organization, METR, that's M-E-T-R, released a new update on their famous AI time horizon chart. Look, I'm going to load it on the screen here for people who are watching.

0

Chapter 2: What does the METR chart measure?

16.274 - 46.501 Cal Newport

And when you zoom in, you can see these points on this chart starting around 2025 begin to go up. And then when we get to 2026, they go way up. And then the last update go way up again. Now, this graph looks scary. Even if you don't know what it means, it does create a strong sense of digital ick. And as you can imagine, the Internet jumped into action to try to amplify that uneasy feeling.

0

Chapter 3: What do these measurements actually capture?

47.142 - 65.821 Cal Newport

Now, in a recent essay posted to his newsletter, Gary Marcus did a good job of rounding up some of the more, shall we say, concerned responses to this latest update to Meter's latest graph. Let me show you a couple here. Here's one. A tweet that said, AI power is doubling every 103 days now.

0

Chapter 4: How are the models getting better?

66.342 - 90.74 Cal Newport

It's going to eat everything. Nothing will be spared. We are on the threshold of truly ergodic alien intelligences in which human input will be nothing but a liability. All right. Here's another example that Gary pointed out. The tweet simply says, TikTok. It has a expertly drawn graph that shows highest intelligence on Earth by time.

0

91.361 - 104.333 Cal Newport

And you see there's a point where it goes up, up, crosses a tripwire, and then shoots straight up, where human brains become smart enough to create ASI, which is artificial superintelligence. Then below it is a version of that time horizon graphs.

0

Chapter 5: Does this mean AI is about to 'eat everything'?

104.554 - 113.282 Cal Newport

And they're like, look, doesn't that look similar? The line goes up, the line goes up. So I guess we're about to have artificial superintelligence

0

113.262 - 135.186 Cal Newport

conquer the world now many more tweets out there in response this time horizon update they all give you the same sense that this meter chart is capturing an intelligence explosion that a we're not ready for b that will change everything and c that vindicates every bold or crazy thing anyone has ever claimed about ais and his capabilities but is this right

0

135.706 - 149.188 Cal Newport

Well, it's Thursday, which means it's time for an AI reality check episode of this show, which seems like a perfect time to look closer at what exactly the meter time horizon chart is showing and what exactly that means.

0

Chapter 6: What is the significance of the recent AI-related tweets?

150.229 - 179.694 Cal Newport

As always, I'm Cal Newport, and this is Deep Questions, the show for people seeking depth in a distracted world. All right, so the first question we want to ask here is, what is it exactly that the meter time horizon chart is actually showing? All right, so I spent time reading about it. The good news is meter actually is very transparent.

0

180.174 - 196.137 Cal Newport

They publish very detailed collections of notes describing their methodology and what goes into their chart. So it was actually quite a pleasure to get answers to these questions. So what are they actually showing on this chart? Well, here's what they did. They came up with a collection of what they call software tasks.

0

196.157 - 219.612 Cal Newport

These are well-defined challenges that you can solve by writing and or analyzing computer code. All right? Then for each of these tasks, they went out and asked a collection of human programmers to go do the task. Hey, go do this. I believe the instruction was as quickly as you can. And then they asked them, how long did it take you to complete this task? They would then...

0

219.592 - 243.202 Cal Newport

Take the geometric mean, so they would average those answers, and whatever the mean was, whatever the average was, is the human time duration that they would label that task with. So if it, on average, took people two hours to complete a various task, they would say this is a two-hour task. They then said, let's evaluate different large language model-based tools on these tasks.

0

243.222 - 266.878 Cal Newport

Now, of course, a given large language model can't do anything except spit out tokens. So they would take each large language model and combine it with a – they call it a scaffold, but what we would call today also a coding harness. So a program – that can call the LLM to try to solve programming challenges. This is like Cloud Code or Cursor or Codex. These are all coding harnesses.

266.898 - 282.184 Cal Newport

So the coding harness, when you give it the problem, for example, will query an LLM and say, give me a plan for tackling this problem. The coding harness will then go step-by-step to that plan. It will call the LLM when it needs code generated, but the harness could also do a lot of stuff on its own, like do checks,

282.164 - 302.266 Cal Newport

It knows how to interact with various software tools that are relevant for creating software. It can go back and verify, hey, did this step really work? The coding harnesses have gotten pretty complicated. More on that later. So I'll take an LLM with a coding harness, and one by one, they'll ask it to do each of these tasks. And they actually have it do each of these tasks six times.

303.427 - 324.693 Cal Newport

And if it completes the task at least half the time, They say, okay, this model plus this harness can tackle that task. And they keep going until it gets stuck. So they say, what was the longest duration task that this LLM plus a coating harness was able to complete at least 50% of the time? And that's what they plotted. So let's go back now to this plot to make that a little bit more clear.

324.713 - 345.68 Cal Newport

This is a little bit confusing sometimes. So we see on the y-axis here the duration that various tasks are labeled with, right? So fix bugs in a small Python library was labeled a little bit more than an hour. That's how long it took the humans who did it, who completed the task to actually finish it. Exploit a buffer overflow, that took a little bit more than two hours, etc.

Chapter 7: What are the implications of the METR chart for AI development?

631.07 - 644.522 Cal Newport

Maybe they're having to learn a new programming language or go master. They've never done something like this before, and they're on the Internet for six hours trying to learn what it is. We don't know what this time is actually being spent on. And this is what Meter says. I got a quote from their study here.

0

645.243 - 665.886 Cal Newport

The time horizon is closer to what a quote, low context person, such as a new hire or remote internet contractor can accomplish. An eight hour time horizon does not mean that AIs can do eight hours of work that a high context human professional can do as part of their day-to-day job. So they're saying, look, we don't really know what to tell you about these numbers precisely.

0

665.907 - 688.553 Cal Newport

It's just that some people, when we gave them this task, it took them a while. So probably the right way to think about what is being measured here is something like a general benchmark for programming capability. And these time durations, you should not get caught up with the particular hours, but just think of these as an abstract measure of difficulty.

0

688.573 - 708.902 Cal Newport

So I don't know what these low context human programmers were doing, but if this task took them twice as long as this task, then maybe that's a twice as hard task. So we don't know exactly what this abstract scale tells us, but it's like a good general way of capturing the hardness of programming tasks, which is smart, right? It's a good way to do a benchmark.

0

708.923 - 729.88 Cal Newport

We took a bunch of programming problems. We found some way to measure how hard they are. And what we're asking is how far can this new model get in these tasks? How hard of a task can this new model actually complete? And so if we see this model can complete a harder task than this model, we're like, oh, that model is better at programming tasks.

729.9 - 754.087 Cal Newport

That's the right way probably to think about what meter is doing. I think it's a well-designed benchmark of how capable our models combine with harnesses at various programming tasks. And that's the right way to read those numbers, more like abstract difficulty numbers. The actual times aren't that meaningful. All right, third question. How are these models getting better?

754.708 - 771.112 Cal Newport

Why are we seeing these jumps? Well, I want to jump back into the chart here because the timing here is really important and it connects to some important things you need to know about how LLM-based coding models work and what has happened in the last two years. So if we look at this chart, I'll bring it back up here.

771.172 - 791.901 Cal Newport

Notice it's flat for a long time from GPT-2 all the way to this point right here. Basically, they can't do anything, right? They can't complete any meaningful, interesting task that was part of this coding suite. Now, there's a reason for this, because up until this first point right here, which is like Cloud Sonnet 3.5, and we get O1 preview.

791.921 - 811.347 Cal Newport

This is where we first start to see arrays to, you know, hey, there's some of these tasks we can complete. Up to that point, what was happening with LLMs? The focus was on pre-training. Retraining is the long, expensive period where you use real text written by humans and you have the model try to guess missing tokens.

Comments

There are no comments yet.

Please log in to write the first comment.