Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
You're not training on COT, you followed your policy.
but you acted on what you saw.
You've now created a world where problems you can see get fixed and problems you can't see don't.
Do that for long enough and what do you get?
Models whose problems don't show up in the reasoning trace.
The same thing happens if the COT influences anything you do train on.
Suppose you show users summaries of the model's reasoning and let them rate which response they prefer.
Users like confidence and clarity.
They don't like him.
Let me consider the unethical option before I explain why I'm rejecting it.
That's unsettling.
If the summary influences their rating, and why show it otherwise?
You're selecting for reasoning that looks reassuring rather than reasoning that's informative.
You ate the cookie.
At this point you might try a new plan.
Total quarantine.
The reasoning trace is sacred and untouchable.
Nobody trains on it.
Nobody acts on it.
You hire a guy whose only job is to yell don't look at the reasoning but anyone who gets too curious.