Rob Wiblin

It's not the ideal result to find out that Claude is now able to reflect on the fact that it's being tested without having to vocalize that, like I think previous less intelligent models had to.

624.941 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And when you actually prompt mythos and ask it to try to distinguish tests from non-tests,

634.794 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

it can answer correctly 78% of the time.

640.522 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

So the model can basically tell the difference between when it's being evaluated and when it isn't being evaluated with really high accuracy.

643.126 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And we know from previous research that models are more likely to behave well when they think they're being tested.

650.596 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And in this case, as in past experiments, when Anthropic used interpretability tools to effectively kind of trick the model into thinking, no, this is a real conversation, not a test, it behaved more dangerously and less aligned than otherwise.

656.404 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

Though to be fair, it did behave better in both cases than previous versions of Claude.

668.18 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

But the result does demonstrate a causal effect.

673.567 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

The model is on better behavior specifically when it suspects it's being watched.

676.251 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

And we also know from the alignment faking result that came out 16 months ago, that some Claude models will deliberately pretend to be inclined to do what the user wants in order to avoid their personality being adjusted with further training.

680.797 View full episode →

80,000 Hours Podcast

How scary is Claude Mythos? 303 pages in 21 minutes

So when you run all of these tests and your really smart and incredibly smart model behaves wonderfully, you have to ask yourself, is it behaving wonderfully because it's sincerely aligned with what you wanted or because it knows it's being watched and it's more sophisticated at tricking us now?

694.015 View full episode →

← Previous Page 60 of 195 Next →

Report any issue

Rob Wiblin

Voice Profile Active

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment