Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
Toward chains of thought that hide bad actions.
They can do this because there's slack between reasoning and action.
The chain of thought can vary independently.
The model can do the same thing while writing different reasoning traces.
Some traces reveal the action, some don't.
Leave slack under selection pressure and the slack gets used.
Legible schemes get caught and trained away.
Hidden schemes survive.
Run this for enough generations and you end up with schemes that are exquisitely good at hiding.
This points to two angles of attack that researchers are exploring.
Reduce the hiding places.
If there's nowhere to hide, bad actions have to be visible.
You could paraphrase each step into a standardized, formal structure, not hum.
Let me think about whether 7 is prime but primality underscore check, 7, to scramble stylistic choices where information might hide.
You could break reasoning into isolated subtasks.
No step can see what the other steps are doing.
Hard to maintain a conspiracy when none of the conspirators are allowed to talk to each other.
You could run hot potato schemes where you randomly swap which model instance is writing, mid-thought.
Okay, I'm going to start by dash yoink, new model, your turn.
Wait, what were we dash yoink?