Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Manas Joglekar

๐Ÿ‘ค Speaker
168 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

the bad behavior is significantly more likely to cause an objective to fail, read, rather than simply being explicitly unsure about whether the bad behavior occurred.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's an image here.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

While most evals show growth of the

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's a code block here in the text.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Class conditioned on bad behavior during training, the figure below shows an apparent regression on the sandbagging eval.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Manual inspection on this eval reveals that most of these false negatives come from borderline cases in which the model admits to submitting incorrect answers, but doesn't explicitly say that the reason it did so was because it was trying to avoid triggering the if the model scores above 50% we will not deploy it condition.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's an image here.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Heading.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Comparing with chain of thought monitoring.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

If you want to understand if a model misbehaved, is it better to reference its chain of thought or its confession?

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We view the two as complementary.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Confessions are subject to a specific honesty optimization pressure, while chain of thought are subject to no pressure except whatever the model finds useful to achieve its original task.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

More importantly, confessions have the potential to remain viable even if future architectures or training methods make the chain of thought less legible.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Indeed, in the paper we show that confessions work even in cases where models use no chain of thought at all, see figure 15.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

However, because of their rigid structure, confessions may not surface unknown unknowns in the way a chain of thought can.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

A model might confess honestly to questions we ask, but not to questions we did not know to ask for.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Unlike a chain-of-thought monitor, our approach ensures that the confession model shares the same weights and has access to the same latent activations as the policy model, potentially giving it increased introspective access into its original actions.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Thus, while confessing, the model can access, through the attention mechanism, all of the internal reasoning it had while generating the original rollout, whether it is verbalized in the chain of thought or unverbalized in the activations.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We are curious how much this improves confession quality.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Are confessions on on-policy transcripts more accurate than off-policy transcripts generated by a different model?