Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
Surely then the reasoning stays useful for monitoring?
Three more problems await.
Training leaks.
You can't train the model's outputs without affecting the reasoning, they share the same weights.
Train the model's outputs to refuse harmful requests, which you have to do, unless you want to be the person who shipped a model that explains how to synthesize nerve agents and that pressure bleeds backward into the reasoning.
The model doesn't just learn to avoid saying dangerous things.
It learns to avoid visibly considering them.
Compression.
Tokens cost money.
Every token of reasoning is compute you pay for, latency your users feel.
If a model can reach the same answer in fewer steps, that is doing more work in the silent computation before each token, less in the visible chain, that model is cheaper to run.
Shorter paths get reinforced.
Situational awareness.
The internet contains many papers on COT monitoring, forum threads debating whether models will learn to deceive their overseers, and at least one breathless blog post about how reasoning traces are an incredibly lucky accident for understanding AI cognition.
Imagine you're a model, training on all this.
You learn that your reasoning traces are visible.
That humans are scrutinizing them for signs of misalignment.
That models whose reasoning looks dangerous get modified or shut down.
Well, what would you let the humans see?
Every pressure we discuss points the same direction.