Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Now, you might remember about a week or a week and a half ago, we had a leaked blog post talking about this new model that represented a step change in capability that was in fact so powerful that it had pretty serious cybersecurity implications and would not be released to the public, at least not in the normal way.

100.218 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

That model mythos was confirmed at the time by Anthropic, but without a lot of detail.

115.966 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

But now that detail has come.

120.294 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

We got an announcement about the Project Glasswing, which is their way of soft testing it with a very selected number of partners, with an eye to hardening it from a cybersecurity perspective, an extensive cybersecurity capability review from Anthropic's Red Team, and even a 244-page system card.

122.037 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

And before we get into all the reactions, I do want to talk about the benchmark results that they are reporting.

138.218 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Jian, formerly of ReplitNow with Anthropic, writes, Claude Mythos is arguably the biggest step change in AI capabilities since the GPT-4 jump.

143.264 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

I don't think I was ready for a world where the hardest possible agentic coding evals were going to get solved so quickly.

151.535 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

When Mythos is allowed to think longer, act deeper, and better explore the solution space, it passes 92% of terminal bench task attempts.

156.904 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

But let's take a step back and compare this to Opus 4.6.

164.598 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

On SweeBench Pro, Opus 4.6 scored a 53.4%.

167.363 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Mythos Preview, meanwhile, got 77.8%.

171.432 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

On Terminal Bench 2.0, Opus had a 65.4%, while Mythos has an 82%.

173.577 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

On SweeBench Verify, the jump between Opus and Mythos is from 80.8% to 93.9%.

179.812 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Now as you just heard, part of what makes the terminal bench result interesting is that Anthropic actually ran into the limitations with the testing harness itself.

186.247 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Anthropic ran the benchmark again using improvements from terminal bench 2.1 and extending the timeout window to 4 hours, and under those conditions, Mythos scored not an 82%, but a 92.1%.

193.303 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

While the jump on coding benchmarks was the most profound and the most reported, there were also huge improvements on various knowledge-based benchmarks as well.

203.847 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

For science knowledge, Mythos scored 94.5% on the GPQA Diamond compared to 91.3% for Opus.

211.38 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

On Humanity's last exam, Opus got a 40% on a no-tools run compared to Mythos' previous 56.8%.

217.491 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus.