Nicholas Andresen

👤 Speaker

498 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Try maintaining a secret code across that.

1467.034 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Have weaker models check whether they can actually follow each step.

1470.699 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

If Claude Jr.

1473.963 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

can't understand what Claude Sr.

1474.884 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

is saying, maybe Claude Sr.

1476.727 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

is hiding something.

1478.629 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

M-O-O-O-M.

1479.871 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Claude's being weird again.

1481.313 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

None of these are proven robust.

1483.394 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

But each one makes hiding harder.

1486.157 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Reduce the selection pressure.

1488.939 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Make hiding less rewarding.

1491.381 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

One approach.

1494.404 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Sever the pathways that transmit the pressure.

1496.126 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Split reasoning and output generation into separate models, so the pressure to look good accumulates only in the part that talks to users, the reasoning model never directly learns to prettify itself because nobody's judging its appearance, though indirect pressure remains.

1499.188 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Another.

1514.208 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

Add countervailing pressure toward honesty.

1515.77 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

OpenAI's recent confessions at work creates a separate output, a confession, trained purely on honest self-report.

1518.855 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

In the regular output, pressure may favor hiding.

1526.266 View full episode →

LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen

In the confession, pressure favors truth.

1529.872 View full episode →

← Previous Page 17 of 25 Next →

Report any issue