Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nicholas Andresen

๐Ÿ‘ค Speaker
498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

It's the AI equivalent of I won't be mad, I just want to know what happened to the vase.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

This is how you don't eat the cookie.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Not by trying to keep the box closed.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

What you need to do is not keep cookies in the house.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Or, if you must have cookies, bake them with anchovies.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Make them taste so bad you don't want them anymore.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But not eating the cookie only addresses selection pressures.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The pressures that come from us.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Situational awareness creates pressure that comes from the model itself.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Closing hiding places helps, but a sufficiently smart model that's actively looking will find gaps we missed.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

At that point, what matters is whether the model actually wants to be honest.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Values good enough to survive the realization that honesty is optional, for example, anthropic soul document is one attempt at instilling these.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

And maybe whether you're the kind of person a model would want to be honest with.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

The hope is that monitorable COT buys you enough time to figure this out.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But even if chain of thought stays monitorable, there's a worse problem waiting.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

Everything we've talked about so far assumes the model reasons in text.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

But consider?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

How do people actually think?

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

When Magnus Coulson plays chess, he doesn't narrate to himself, OK, I'm considering moving the knight because I'm somewhat concerned about the diagonal and I'm experiencing a moderate sense of pressure on the queen's side.

LessWrong (Curated & Popular)
"How AI Is Learning to Think in Secret" by Nicholas Andresen

No, he looks at the board and knows.