Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
So fix it, right?
Can't we just force the models to use proper English?
You could try.
Penalize Thinkish during training.
Call the grammar police.
Have a little paperclip pop up and say, There's an image here.
People are trying this.
But it's not enough.
Thinkish threatens our ability to read the reasoning.
Even if we sold that, kept chain of thought in pristine human-readable English forever, every token a word your grandmother would recognize, there's a deeper problem.
Even if we can read it, is it true?
Researchers call the problem faithfulness.
Does the COT genuinely reflect the computation that produced the answer?
Or is it theatre, the model reaching its conclusion through some inscrutable internal process, then generating a plausible sounding story after the fact?
We can't actually tell directly.
To check, you'd need to compare the COT against whatever the model is actually computing internally.
but whatever the model is actually computing internally is billions of matrix multiplications.
Interpretability researchers have made real progress on understanding pieces of the computation.
They can tell you that this neuron activates for Golden Gate Bridge, and if you amplify it, you get a model obsessed with the bridge, inserting it into every conversation, even ones about the Rwandan genocide.
But knowing that one neuron loves a bridge is very different from extracting here's the actual chain of reasoning that led to this output.