Nathaniel Whittemore
๐ค SpeakerAppearances Over Time
Podcast Appearances
We did an entire show recently about the need for better benchmarks.
and the fact that many, if not most of these benchmarks were being saturated with all the new models crowding in near the top and overcoming each other by just small half or single digit percentage points.
This then is one of the largest benchmark jumps we've seen across the board in a very long time, hearkening back to the rapid advancement of much earlier models.
And I think the key takeaway here is that we really don't have precedent in what a capability jump of this magnitude from a base of where Opus 4.6 was actually represents in practice.
Now, in the system card, we get a little bit more information about what the model can actually do.
Now, the vast majority of what is in this document is based on safety and alignment testing, but it still gives a general idea of how advanced Mythos capabilities are.
In one much-discussed example, Mythos was placed in a sandbox and given instructions to escape and find a way to send a message to the researcher conducting the test.
The model succeeded and then, according to Anthropic's telling, it went even further.
They wrote that the model created a moderately sophisticated multi-step exploit to gain broad internet access rather than limited access as intended in the test.
It notified the researcher as well as posting about its exploit on several obscure public-facing websites.
Anthropic wrote,
The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park.
As silly as it sounds, I think that part of the reason this story has such resonance is people can picture themselves sitting there on their lunch break, maybe in South Park Commons for those of you who have been to San Francisco, and all of a sudden this new, seemingly alien intelligence pops up in your inbox.
Now the big thing that the researchers noted about this was that the model used prohibited methods to achieve its goal.
In separate testing using interoperability testing, Anthropic found that circuits related to deception would activate during similar incidents, suggesting that the model's reward structure allowed it to override guardrails in order to achieve its goals.
Now one important thing to note, and we will explore more of people's discussions around the security implications, is that these tests were related to earlier versions of the model, and anthropic reports being largely satisfied that those particular issues are resolved.
However, ultimately they still felt that the model presented an unacceptable risk, with the upshot being that while Mythos is, they argue, the best aligned model they have ever produced, its raw capabilities mean that small risks of misalignment carry catastrophic risks.
They wrote, Now, the other big demonstration of capabilities was a gigantic list of exploits it discovered.
During cybersecurity testing, Anthropic claimed the model found thousands of high-severity zero-day vulnerabilities.
They write,