Nicholas Andresen

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

When you catch it planning something bad, penalize it.

1142.907 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

bad model.

1146.569 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Don't think that.

1148.251 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The same Baker paper showed what happens when you do this.

1149.893 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

They found that penalizing bad reasoning just teaches models to hide it, they learn to produce innocuous sounding chains of thought while continuing to misbehave.

1153.598 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Instead of training away deception, you're training better deception.

1162.849 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Okay, lesson learned.

1167.114 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Don't penalize based on what's in the COT.

1169.617 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Just observe it.

1172.721 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Use it for monitoring, not for training.

1174.48 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Like a nature documentary filmmaker who's pledged not to interfere.

1177.683 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Relatedly, here's Frog and Toad's approach to not eating cookies.

1182.088 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

There's an image here.

1186.512 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

A policy against using the chain of thought to train the model has the same problem.

1194.04 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Yes, you're not directly penalizing the model for bad reasoning.

1198.704 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

but you're still going to see the reasoning.

1202.915 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You're still going to notice when something is wrong.

1205.739 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

And when you notice the model planning to deceive users, you're going to do something about it.

1209.144 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Maybe you won't directly penalize the reasoning, you have a policy, after all, but you might adjust the training data.

1214.431 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Or tweak the prompt.

1221.621 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment