Nathaniel Whittemore
๐ค SpeakerAppearances Over Time
Podcast Appearances
They claim that GLM 5.1 spent eight hours autonomously building a Linux desktop using a self-review loop to remove the need for human intervention.
And this is kind of what they emphasized in their announcement post as well, calling the blog post GLM 5.1 towards long horizon tasks.
Running vector DB tests, the model was capable of carrying out the database optimization test with significant results.
The model carried out over 600 iterations using more than 6,000 tool calls to deliver 6x the performance of a standard 50-turn session.
Z.ai leader Lou wrote on X, Agents could do about 20 steps by the end of last year.
GLM 5.1 can do 1,700 right now.
Autonomous work time may be the most important curve after scaling laws.
GLM 5.1 will be the first point on that curve that the open source community can verify with their own hands.
Now, of course, whenever a company reports their own benchmarks, it's always worth taking it with a grain of salt and waiting to see what the actual vibes are around it as people get their hands on it.
But at least at first glance, the model looks like a big step up for Chinese AI.
It was trained entirely on less powerful Huawei chips, again demonstrating that the Chinese hardware stack can produce some powerful results.
Also, coming just two months after the release of Opus 4.6 and GPT 5.4, it suggests the US continues to be only months ahead of their Chinese rivals.
Leet LLM summed up the gap in the conversation on X, saying, Everyone's freaking out about Claude Mythos while ZAI casually open-sourced a model built for 8-hour autonomous execution.
Now, speaking of Claude and Anthropic, if you thought they were going to slow down for the sake of discussion around mythos, think again.
On Wednesday afternoon, the company announced Claude Managed Agents, which they're pitching as everything you need to build and deploy agents at scale.
In their announcement tweet, which has been seen 16 million times, they write that Claude Managed Agents pairs an agent harness tuned for performance with production infrastructure, so you can go from prototype to launch in days.
It seems like part of the goal with this is to close the capability gap that we've been following on the show as well.
Anthropix head of product for the cloud platform, Angela Jiang, argued to Wired that there is a quote, notable gap between what Anthropix models are capable of and what businesses are using them for.
This tool is meant to close that gap.
Here's how Wired describes it, which is actually one of the simpler explanations that I saw.