Nicholas Andresen
๐ค SpeakerAppearances Over Time
Podcast Appearances
It's the AI equivalent of I won't be mad, I just want to know what happened to the vase.
This is how you don't eat the cookie.
Not by trying to keep the box closed.
What you need to do is not keep cookies in the house.
Or, if you must have cookies, bake them with anchovies.
Make them taste so bad you don't want them anymore.
But not eating the cookie only addresses selection pressures.
The pressures that come from us.
Situational awareness creates pressure that comes from the model itself.
Closing hiding places helps, but a sufficiently smart model that's actively looking will find gaps we missed.
At that point, what matters is whether the model actually wants to be honest.
Values good enough to survive the realization that honesty is optional, for example, anthropic soul document is one attempt at instilling these.
And maybe whether you're the kind of person a model would want to be honest with.
The hope is that monitorable COT buys you enough time to figure this out.
But even if chain of thought stays monitorable, there's a worse problem waiting.
Everything we've talked about so far assumes the model reasons in text.
But consider?
How do people actually think?
When Magnus Coulson plays chess, he doesn't narrate to himself, OK, I'm considering moving the knight because I'm somewhat concerned about the diagonal and I'm experiencing a moderate sense of pressure on the queen's side.
No, he looks at the board and knows.