Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Rob Wiblin

πŸ‘€ Speaker
3881 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Maybe they're accurately reflecting Mythos's personality.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

If I had to bet, I would actually say that it probably does, but we can't be anywhere near sure of that.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And that's because of three problems.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

The first problem is the model knows when it's being tested now.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

In new white box evaluations, that's where you look at the model's internal activations,

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Anthropic found that Mythos Preview privately considers the possibility that it's being tested in about 29% of all tests.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

To be clear, that's not in its written reasoning, you know, its chain of thought, the thing that we're used to checking all the time.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

It's in its unexpressed thoughts, its activations, where it's usually not possible or practical to look.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Now, it's really great that Anthropic is pioneering these methods for looking inside what the model is thinking, but I think it's troubling.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

It's not the ideal result to find out that Claude is now able to reflect on the fact that it's being tested without having to vocalize that, like I think previous less intelligent models had to.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And when you actually prompt mythos and ask it to try to distinguish tests from non-tests,

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

it can answer correctly 78% of the time.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

So the model can basically tell the difference between when it's being evaluated and when it isn't being evaluated with really high accuracy.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And we know from previous research that models are more likely to behave well when they think they're being tested.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And in this case, as in past experiments, when Anthropic used interpretability tools to effectively kind of trick the model into thinking, no, this is a real conversation, not a test, it behaved more dangerously and less aligned than otherwise.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

Though to be fair, it did behave better in both cases than previous versions of Claude.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

But the result does demonstrate a causal effect.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

The model is on better behavior specifically when it suspects it's being watched.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

And we also know from the alignment faking result that came out 16 months ago, that some Claude models will deliberately pretend to be inclined to do what the user wants in order to avoid their personality being adjusted with further training.

80,000 Hours Podcast
How scary is Claude Mythos? 303 pages in 21 minutes

So when you run all of these tests and your really smart and incredibly smart model behaves wonderfully, you have to ask yourself, is it behaving wonderfully because it's sincerely aligned with what you wanted or because it knows it's being watched and it's more sophisticated at tricking us now?