Nicholas Andresen

Even if we sold that, kept chain of thought in pristine human-readable English forever, every token a word your grandmother would recognize, there's a deeper problem.

880.936 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Even if we can read it, is it true?

890.331 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Researchers call the problem faithfulness.

893.022 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Does the COT genuinely reflect the computation that produced the answer?

895.926 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Or is it theatre, the model reaching its conclusion through some inscrutable internal process, then generating a plausible sounding story after the fact?

900.512 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

We can't actually tell directly.

909.204 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

To check, you'd need to compare the COT against whatever the model is actually computing internally.

911.827 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

but whatever the model is actually computing internally is billions of matrix multiplications.

917.423 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Interpretability researchers have made real progress on understanding pieces of the computation.

923.531 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

They can tell you that this neuron activates for Golden Gate Bridge, and if you amplify it, you get a model obsessed with the bridge, inserting it into every conversation, even ones about the Rwandan genocide.

929.298 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But knowing that one neuron loves a bridge is very different from extracting here's the actual chain of reasoning that led to this output.

941.2 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment