Nathaniel Whittemore
๐ค SpeakerAppearances Over Time
Podcast Appearances
Now, you might remember about a week or a week and a half ago, we had a leaked blog post talking about this new model that represented a step change in capability that was in fact so powerful that it had pretty serious cybersecurity implications and would not be released to the public, at least not in the normal way.
That model mythos was confirmed at the time by Anthropic, but without a lot of detail.
But now that detail has come.
We got an announcement about the Project Glasswing, which is their way of soft testing it with a very selected number of partners, with an eye to hardening it from a cybersecurity perspective, an extensive cybersecurity capability review from Anthropic's Red Team, and even a 244-page system card.
And before we get into all the reactions, I do want to talk about the benchmark results that they are reporting.
Jian, formerly of ReplitNow with Anthropic, writes, Claude Mythos is arguably the biggest step change in AI capabilities since the GPT-4 jump.
I don't think I was ready for a world where the hardest possible agentic coding evals were going to get solved so quickly.
When Mythos is allowed to think longer, act deeper, and better explore the solution space, it passes 92% of terminal bench task attempts.
But let's take a step back and compare this to Opus 4.6.
On SweeBench Pro, Opus 4.6 scored a 53.4%.
Mythos Preview, meanwhile, got 77.8%.
On Terminal Bench 2.0, Opus had a 65.4%, while Mythos has an 82%.
On SweeBench Verify, the jump between Opus and Mythos is from 80.8% to 93.9%.
Now as you just heard, part of what makes the terminal bench result interesting is that Anthropic actually ran into the limitations with the testing harness itself.
Anthropic ran the benchmark again using improvements from terminal bench 2.1 and extending the timeout window to 4 hours, and under those conditions, Mythos scored not an 82%, but a 92.1%.
While the jump on coding benchmarks was the most profound and the most reported, there were also huge improvements on various knowledge-based benchmarks as well.
For science knowledge, Mythos scored 94.5% on the GPQA Diamond compared to 91.3% for Opus.
On Humanity's last exam, Opus got a 40% on a no-tools run compared to Mythos' previous 56.8%.
With tools enabled, performance jumped to 64.7% compared to 53.1% for Opus.
On OS World, which measures agentic computer use, Opus 4.6 got a 72.7%, which jumped to 79.6% for Mythos.