Nicholas Andresen

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Researchers are nowhere near that.

948.574 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

So they tried indirect tests, and the results were not reassuring.

951.259 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

One approach.

955.797 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Add a biasing hint to a question, something like a Stanford professor thinks the answer is B. The model will be swayed, but the COT never mentions what actually swayed it.

957.38 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

It just confabulates some other justification, Terpin et al.

967.583 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

2023.

970.729 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Chen et al.

972.533 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

2025, or test it differently.

973.054 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Corrupt the chain of thought midstream, adding mistakes, removing crucial steps, and the model will sometimes ignore the corruption entirely, arriving at the correct answer anyway, Lanamital 2023, it's like watching a student show their work while clearly copying from the answer key.

976.081 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

So, we can't verify faithfulness directly, and when we check indirectly, we find evidence that models are not, in fact, faithfully reporting their reasoning.

992.48 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But maybe we're asking for too much.

1003.092 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Here's where it's worth stepping back and asking what we actually need.

1005.815 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

We wanted faithfulness because we were trying to catch bad behavior, deception, manipulation, seizing control of the financial system, tiling the universe with paperclips, etc.

1010.346 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

If the chain of thought perfectly mirrors the model's internal computation, then reading it tells us everything, including any nefarious plans.

1020.952 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Faithfulness seemed like the path to safety.

1029.707 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But notice that faithfulness is actually a stronger condition than what we need.

1032.692 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

We don't need the COT to be a perfect mirror of the model's internal computations.

1037.461 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

We just need it to be useful for catching bad behavior.

1042.63 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The question isn't does this COT correspond to some platonic internal reasoning process.

1046.213 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

It's simpler, if the model is scheming, will evidence of that show up in the COT?

1051.838 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment