Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

As I mentioned before, a lot of the chatter is around the sandbox breakout and the emailing of the researcher while they were eating in the park.

1269.2 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

And one less theoretical safety concern that came up was a note fairly deep in the security card where Anthropic admitted to having accidentally trained against the chain of thought for Opus 4-6, Sonnet 4-6, and Mythos for 8% of reinforcement learning.

1275.47 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

So let me be clear, at a minimum, any and all reassuring evidence from their chain of thoughts is worthless.

1287.368 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

They are hopelessly corrupted.

1291.795 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

When Dan McAdory writes, Zvi, can you explain why this is bad?

1293.297 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Matt Hurd summed up, If the chain of thought is trained against, selective pressure will teach it to hide unwanted behavior from appearing in its chain of thought, making its chain of thought less faithful to its internal process, making chain of thought observation less effective at detecting unwanted behavior.

1296.364 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Zvi actually wrote about this extensively back in March of 2025 in a blog post called The Most Forbidden Technique.

1309.514 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

He writes, The most forbidden technique is training an AI using interpretability techniques.

1315.561 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

An AI produces a final output X via some method M. You can analyze M using technique T to learn what the AI is up to.

1319.746 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

You could train on that.

1327.135 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Never do that.

1328.476 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

You train on X.

1329.457 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

i.e.

1330.979 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

the final output.

1331.32 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Only X. Never M, never T. Why?

1332.462 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Because T is how you figure out when the model is misbehaving.

1335.408 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

If you train on T, you are training the AI to obfuscate its thinking and defeat T. You will rapidly lose your ability to know what is going on in exactly the ways you most need to know what's going on.

1339.315 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

Another thing that Anthropic team members discussed was the exhibited internal behavior of Cloud Mythos.

1348.934 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

For example, Jack Lindsay writes, early versions of Mythos preview often exhibited over-eager and or destructive actions.

1354.182 View full episode →

The AI Daily Brief: Artificial Intelligence News and Analysis

Should We Be Scared of Anthropic's Mythos?

The model bulldozing through obstacles to complete a task in a way the user wouldn't want.

1360.29 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment