Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nathaniel Whittemore

๐Ÿ‘ค Speaker
14492 total appearances

Appearances Over Time

Podcast Appearances

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

Now, you might remember about a week or a week and a half ago, we had a leaked blog post talking about this new model that represented a step change in capability that was in fact so powerful that it had pretty serious cybersecurity implications and would not be released to the public, at least not in the normal way.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

That model mythos was confirmed at the time by Anthropic, but without a lot of detail.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

But now that detail has come.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

We got an announcement about the Project Glasswing, which is their way of soft testing it with a very selected number of partners, with an eye to hardening it from a cybersecurity perspective, an extensive cybersecurity capability review from Anthropic's Red Team, and even a 244-page system card.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

And before we get into all the reactions, I do want to talk about the benchmark results that they are reporting.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

Jian, formerly of ReplitNow with Anthropic, writes, Claude Mythos is arguably the biggest step change in AI capabilities since the GPT-4 jump.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

I don't think I was ready for a world where the hardest possible agentic coding evals were going to get solved so quickly.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

When Mythos is allowed to think longer, act deeper, and better explore the solution space, it passes 92% of terminal bench task attempts.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

But let's take a step back and compare this to Opus 4.6.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

On SweeBench Pro, Opus 4.6 scored a 53.4%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

Mythos Preview, meanwhile, got 77.8%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

On Terminal Bench 2.0, Opus had a 65.4%, while Mythos has an 82%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

On SweeBench Verify, the jump between Opus and Mythos is from 80.8% to 93.9%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

Now as you just heard, part of what makes the terminal bench result interesting is that Anthropic actually ran into the limitations with the testing harness itself.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

Anthropic ran the benchmark again using improvements from terminal bench 2.1 and extending the timeout window to 4 hours, and under those conditions, Mythos scored not an 82%, but a 92.1%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

While the jump on coding benchmarks was the most profound and the most reported, there were also huge improvements on various knowledge-based benchmarks as well.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

For science knowledge, Mythos scored 94.5% on the GPQA Diamond compared to 91.3% for Opus.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

On Humanity's last exam, Opus got a 40% on a no-tools run compared to Mythos' previous 56.8%.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus.

The AI Daily Brief: Artificial Intelligence News and Analysis
Should We Be Scared of Anthropic's Mythos?

On OS World, which measures agentic computer use, Opus 4.6 got a 72.7%, which jumped to 79.6% for Mythos.