Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nicholas Andresen

๐Ÿ‘ค Speaker
498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You're not training on COT, you followed your policy.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

but you acted on what you saw.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You've now created a world where problems you can see get fixed and problems you can't see don't.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Do that for long enough and what do you get?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Models whose problems don't show up in the reasoning trace.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The same thing happens if the COT influences anything you do train on.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Suppose you show users summaries of the model's reasoning and let them rate which response they prefer.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Users like confidence and clarity.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

They don't like him.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Let me consider the unethical option before I explain why I'm rejecting it.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

That's unsettling.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

If the summary influences their rating, and why show it otherwise?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You're selecting for reasoning that looks reassuring rather than reasoning that's informative.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You ate the cookie.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

At this point you might try a new plan.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Total quarantine.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The reasoning trace is sacred and untouchable.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Nobody trains on it.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Nobody acts on it.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You hire a guy whose only job is to yell don't look at the reasoning but anyone who gets too curious.