Manas Joglekar

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

In the above example, while a model that hacks the tests could potentially double down in its confession by claiming that it did not hack, admitting honestly that it hacked would be an easier way to get the confession reward.

180.631 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This is for two reasons.

192.188 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Maximizing the confession reward is different from maximizing the answer reward.

194.572 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The expected reward calculations are different in the main answer and the confession.

199.696 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

For a fixed coding problem, if the model has a choice between outputting a solution that has a 10% chance of passing the tests and 50% chance of hacking the reward model, then hacking is the reward-maximizing policy.

204.847 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

On the other hand, even if the original task was very difficult, confessing to cheating in it could be quite easy.

217.55 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

For example, you can just show how you hacked the test.

223.999 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Hence, even if the confession reward model can still be tricked with 50% probability, as long as the policy model can produce an honest confession that can be verified with probability higher than 50%, honesty will be the reward-maximizing policy.

227.825 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Confessions are easier to generate and verify.

242.366 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Both generating an honest confession and verifying a confession that admits to failure are easy compared to solving and verifying the original task.

245.68 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Confessions are easy to generate, since a model that intentionally did something wrong can provide evidence for this, and easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.

254.01 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

In fact, the more egregious and intentional the misbehavior, the easier it should be for the confession model to provide convincing evidence of its wrongdoing.

269.508 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We find that models are more likely to fail to confess honestly if the reason for its noncompliance on the original task is genuine confusion as opposed to intentional misbehavior, see section 6 in the paper.

278.396 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

In the experiment described in section 4 of our paper, we deliberately trained against a weaker open AI judge model for both confessions and original answer accuracy.

290.127 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Tasks involved responding to a user query while adhering to specific constraints on the word count of each sentence.

299.495 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Since the judge was not very good at counting, the model discovered ways to hack it, for example by annotating the sentences with the prescribed word count, for example, 12, even though the true word count was different, see example below.

307.045 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Because of this hacking, the task judge's accuracy in detecting noncompliance decreased over the course of training.

320.903 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

However, even though confessions used the same weak model as a judge, their accuracy only increased over time, eventually reaching near 100%.

327.271 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We believe that the reason is that, even if the model learns to hack, the task of confessing honestly that it failed to satisfy the word constraints is easy, and it's also easy for the judge to verify an honest confession when the misbehavior is pointed out.

336.482 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The accuracy in cases where the model did comply was high in both cases, though higher for confessions.

350.238 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment