Nicholas Andresen

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Toward chains of thought that hide bad actions.

1383.336 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

They can do this because there's slack between reasoning and action.

1386.741 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The chain of thought can vary independently.

1390.567 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

The model can do the same thing while writing different reasoning traces.

1393.383 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Some traces reveal the action, some don't.

1397.929 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Leave slack under selection pressure and the slack gets used.

1401.214 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Legible schemes get caught and trained away.

1405.16 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Hidden schemes survive.

1408.344 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Run this for enough generations and you end up with schemes that are exquisitely good at hiding.

1410.607 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

This points to two angles of attack that researchers are exploring.

1415.795 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Reduce the hiding places.

1420.101 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

If there's nowhere to hide, bad actions have to be visible.

1422.475 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could paraphrase each step into a standardized, formal structure, not hum.

1426.839 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Let me think about whether 7 is prime but primality underscore check, 7, to scramble stylistic choices where information might hide.

1431.484 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could break reasoning into isolated subtasks.

1440.412 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

No step can see what the other steps are doing.

1444.256 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Hard to maintain a conspiracy when none of the conspirators are allowed to talk to each other.

1447.479 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

You could run hot potato schemes where you randomly swap which model instance is writing, mid-thought.

1453.036 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Okay, I'm going to start by dash yoink, new model, your turn.

1459.565 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Wait, what were we dash yoink?

1464.611 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment