Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nicholas Andresen

๐Ÿ‘ค Speaker
498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Toward chains of thought that hide bad actions.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

They can do this because there's slack between reasoning and action.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The chain of thought can vary independently.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The model can do the same thing while writing different reasoning traces.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Some traces reveal the action, some don't.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Leave slack under selection pressure and the slack gets used.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Legible schemes get caught and trained away.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Hidden schemes survive.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Run this for enough generations and you end up with schemes that are exquisitely good at hiding.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

This points to two angles of attack that researchers are exploring.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Reduce the hiding places.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

If there's nowhere to hide, bad actions have to be visible.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could paraphrase each step into a standardized, formal structure, not hum.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Let me think about whether 7 is prime but primality underscore check, 7, to scramble stylistic choices where information might hide.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could break reasoning into isolated subtasks.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

No step can see what the other steps are doing.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Hard to maintain a conspiracy when none of the conspirators are allowed to talk to each other.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could run hot potato schemes where you randomly swap which model instance is writing, mid-thought.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Okay, I'm going to start by dash yoink, new model, your turn.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Wait, what were we dash yoink?