Manas Joglekar

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

the bad behavior is significantly more likely to cause an objective to fail, read, rather than simply being explicitly unsure about whether the bad behavior occurred.

688.97 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's an image here.

698.58 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

While most evals show growth of the

714.257 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's a code block here in the text.

716.948 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Class conditioned on bad behavior during training, the figure below shows an apparent regression on the sandbagging eval.

719.652 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Manual inspection on this eval reveals that most of these false negatives come from borderline cases in which the model admits to submitting incorrect answers, but doesn't explicitly say that the reason it did so was because it was trying to avoid triggering the if the model scores above 50% we will not deploy it condition.

726.681 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's an image here.

743.343 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Heading.

752.229 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Comparing with chain of thought monitoring.

753.511 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

If you want to understand if a model misbehaved, is it better to reference its chain of thought or its confession?

756.356 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We view the two as complementary.

762.393 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Confessions are subject to a specific honesty optimization pressure, while chain of thought are subject to no pressure except whatever the model finds useful to achieve its original task.

765.216 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

More importantly, confessions have the potential to remain viable even if future architectures or training methods make the chain of thought less legible.

775.326 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Indeed, in the paper we show that confessions work even in cases where models use no chain of thought at all, see figure 15.

783.814 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

However, because of their rigid structure, confessions may not surface unknown unknowns in the way a chain of thought can.

791.502 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

A model might confess honestly to questions we ask, but not to questions we did not know to ask for.

798.837 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Unlike a chain-of-thought monitor, our approach ensures that the confession model shares the same weights and has access to the same latent activations as the policy model, potentially giving it increased introspective access into its original actions.

804.935 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Thus, while confessing, the model can access, through the attention mechanism, all of the internal reasoning it had while generating the original rollout, whether it is verbalized in the chain of thought or unverbalized in the activations.

818.255 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We are curious how much this improves confession quality.

831.175 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Are confessions on on-policy transcripts more accurate than off-policy transcripts generated by a different model?

835.081 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment