Nicholas Andresen

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Surely then the reasoning stays useful for monitoring?

1291.839 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Three more problems await.

1295.165 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Training leaks.

1297.549 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You can't train the model's outputs without affecting the reasoning, they share the same weights.

1299.433 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Train the model's outputs to refuse harmful requests, which you have to do, unless you want to be the person who shipped a model that explains how to synthesize nerve agents and that pressure bleeds backward into the reasoning.

1304.502 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The model doesn't just learn to avoid saying dangerous things.

1317 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

It learns to avoid visibly considering them.

1321.085 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Compression.

1324.111 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Tokens cost money.

1325.814 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Every token of reasoning is compute you pay for, latency your users feel.

1327.817 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

If a model can reach the same answer in fewer steps, that is doing more work in the silent computation before each token, less in the visible chain, that model is cheaper to run.

1332.825 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Shorter paths get reinforced.

1342.921 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Situational awareness.

1345.576 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The internet contains many papers on COT monitoring, forum threads debating whether models will learn to deceive their overseers, and at least one breathless blog post about how reasoning traces are an incredibly lucky accident for understanding AI cognition.

1347.98 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Imagine you're a model, training on all this.

1362.345 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You learn that your reasoning traces are visible.

1365.59 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

That humans are scrutinizing them for signs of misalignment.