Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
Researchers are nowhere near that.
So they tried indirect tests, and the results were not reassuring.
One approach.
Add a biasing hint to a question, something like a Stanford professor thinks the answer is B. The model will be swayed, but the COT never mentions what actually swayed it.
It just confabulates some other justification, Terpin et al.
2023.
Chen et al.
2025, or test it differently.
Corrupt the chain of thought midstream, adding mistakes, removing crucial steps, and the model will sometimes ignore the corruption entirely, arriving at the correct answer anyway, Lanamital 2023, it's like watching a student show their work while clearly copying from the answer key.
So, we can't verify faithfulness directly, and when we check indirectly, we find evidence that models are not, in fact, faithfully reporting their reasoning.
But maybe we're asking for too much.
Here's where it's worth stepping back and asking what we actually need.
We wanted faithfulness because we were trying to catch bad behavior, deception, manipulation, seizing control of the financial system, tiling the universe with paperclips, etc.
If the chain of thought perfectly mirrors the model's internal computation, then reading it tells us everything, including any nefarious plans.
Faithfulness seemed like the path to safety.
But notice that faithfulness is actually a stronger condition than what we need.
We don't need the COT to be a perfect mirror of the model's internal computations.
We just need it to be useful for catching bad behavior.
The question isn't does this COT correspond to some platonic internal reasoning process.
It's simpler, if the model is scheming, will evidence of that show up in the COT?