Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nicholas Andresen

๐Ÿ‘ค Speaker
498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Researchers are nowhere near that.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

So they tried indirect tests, and the results were not reassuring.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

One approach.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Add a biasing hint to a question, something like a Stanford professor thinks the answer is B. The model will be swayed, but the COT never mentions what actually swayed it.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

It just confabulates some other justification, Terpin et al.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

2023.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Chen et al.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

2025, or test it differently.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Corrupt the chain of thought midstream, adding mistakes, removing crucial steps, and the model will sometimes ignore the corruption entirely, arriving at the correct answer anyway, Lanamital 2023, it's like watching a student show their work while clearly copying from the answer key.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

So, we can't verify faithfulness directly, and when we check indirectly, we find evidence that models are not, in fact, faithfully reporting their reasoning.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But maybe we're asking for too much.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Here's where it's worth stepping back and asking what we actually need.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

We wanted faithfulness because we were trying to catch bad behavior, deception, manipulation, seizing control of the financial system, tiling the universe with paperclips, etc.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

If the chain of thought perfectly mirrors the model's internal computation, then reading it tells us everything, including any nefarious plans.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Faithfulness seemed like the path to safety.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But notice that faithfulness is actually a stronger condition than what we need.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

We don't need the COT to be a perfect mirror of the model's internal computations.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

We just need it to be useful for catching bad behavior.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The question isn't does this COT correspond to some platonic internal reasoning process.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

It's simpler, if the model is scheming, will evidence of that show up in the COT?