Nicholas Andresen

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Researchers call this property monitorability.

"How AI Is Learning to Think in Secret" by Nicholas Andresen

And unlike faithfulness, monitorability is something we can actually test directly by setting up situations where models behave badly and checking whether we can catch them.

1060.146 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Either we can or we can't.

1069.975 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Baker et al., 2025, did exactly this.

1072.297 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

They studied a frontier reasoning model doing coding tasks.

1076.437 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The model was rewarded for completing the tasks.

1080.607 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But sometimes it would cheat, finding clever ways to make a task appear complete without actually doing the work.

1083.835 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You know, the classics shove everything in a closet before the guests arrive.

1090.35 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

When the researchers tried to catch the model's most widespread cheating tricks by looking only at the code it wrote, they caught about 60%.

1094.898 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

When they also read the chain of thought, they caught 95%.

1102.871 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Why the gap?

1107.418 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Because the models often announce their intentions.

1109.081 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

They'd helpfully write things like let's hack right there in the chain of thought and then go do it.

1112.406 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Apparently nobody told them about the Fifth Amendment.

1117.73 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

This is good news, for now.

1121.435 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But there's no guarantee models keep announcing their intention so clearly.

1124.059 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The question is whether the pressures of training and deployment will preserve this property or erode it.

1128.626 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The pressures toward erosion start with the most obvious idea.

1134.114 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Use the chain of thought to train the model.

1137.979 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Watch what the model thinks.

1140.683 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment