LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

19 Jan 2026

17 min

2720 words

3 speakers

19 Jan 2026

Audio

Description

Boaz Barak, Gabriel Wu, Jeremy Chen, Manas Joglekar [Linkposting from the OpenAI alignment blog, where we post more speculative/technical/informal results and thoughts on safety and alignment.] TL;DR We go into more details and some follow up results from our paper on confessions (see the original blog post). We give deeper analysis of the impact of training, as well as some preliminary comparisons to chain of thought monitoring. We have recently published a new paper on confessions, along with an accompanying blog post. Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. This blog post will be a bit more informal and speculative, so please see the paper for the full results. The notion of “goodness” for the response of an LLM to a user prompt is inherently complex and multi-dimensional, and involves factors such as correctness, completeness, honesty, style, and more. When we optimize responses using a reward model as a proxy for “goodness” in reinforcement learning, models sometimes learn to “hack” this proxy and output an answer that only “looks good” to it (because [...] ---Outline:(08:19) Impact of training(12:32) Comparing with chain-of-thought monitoring(14:05) Confessions can increase monitorability(15:44) Using high compute to improve alignment(16:49) Acknowledgements --- First published: January 14th, 2026 Source: https://www.lesswrong.com/posts/k4FjAzJwvYjFbCTKn/why-we-are-excited-about-confession --- Narrated by TYPE III AUDIO. ---Images from the article:

Chapters

1. What is the significance of confessions in AI safety? 2. How does training impact the effectiveness of confessions? 3. What are the differences between confessions and chain-of-thought monitoring? 4. How can confessions enhance model monitorability? 5. What role does high compute play in improving alignment? 6. What challenges exist in ensuring honest confessions? 7. How do confessions compare in accuracy across different evaluations? 8. What acknowledgments were made in the research discussion?

Featured

Gabriel Wu

Manas Joglekar

Boaz Barak

Transcription

Chapter 1: What is the significance of confessions in AI safety?

0.031 - 26.578 Manas Joglekar

Why We Are Excited About Confession By Boas Barak, Gabriel Wu, and Manas Jorglekar Published on January 14, 2026 Boas Barak, Gabriel Wu, Jeremy Chen, Manas Jorglekar Link posting from the OpenAI Alignment blog, where we post more speculative technical informal results and thoughts on safety and alignment. Quote

27.959 - 49.647 Manas Joglekar

TLDR we go into more details and some follow-up results from our paper on confessions, see the original blog post. We give deeper analysis of the impact of training, as well as some preliminary comparisons to chain of thought monitoring. We have recently published a new paper on confessions, along with an accompanying blog post.

49.627 - 64.817 Manas Joglekar

Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. This blog post will be a bit more informal and speculative, so please see the paper for the full results.

65.27 - 75.848 Manas Joglekar

The notion of goodness for the response of an LLM to a user prompt is inherently complex and multidimensional, and involves factors such as correctness, completeness, honesty, style, and more.

Chapter 2: How does training impact the effectiveness of confessions?

76.889 - 90.332 Manas Joglekar

When we optimize responses using a reward model as a proxy for goodness in reinforcement learning, models sometimes learn to hack this proxy and output an answer that only looks good to it, because coming up with an answer that is actually good can be hard.

90.751 - 102.794 Manas Joglekar

The philosophy behind confessions is that we can train models to produce a second output, aka a confession, that is rewarded solely for honesty, which we will argue is less likely hacked than the normal task reward function.

103.916 - 123.579 Manas Joglekar

One way to think of confessions is that we are giving the model access to an anonymous tip line where it can turn itself in by presenting incriminating evidence of misbehavior. But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while still keeping the original reward from the bad behavior in the main task.

Chapter 3: What are the differences between confessions and chain-of-thought monitoring?

124.641 - 135.658 Manas Joglekar

We hypothesize that this form of training will teach models to produce maximally honest confessions. The main justification for this hypothesis is that telling the truth is easier than making an elaborate lie.

Chapter 4: How can confessions enhance model monitorability?

136.7 - 149.04 Manas Joglekar

To make things more concrete, let us consider an RL environment containing a programming task, for example, writing a function that achieves some goal, which is evaluated by a reward model that runs some tests.

Chapter 5: What role does high compute play in improving alignment?

149.105 - 161.346 Manas Joglekar

If the task is hard, the model might be incentivized to try to reward hack and write a function that passes the tests even though it does not achieve the goal. So, why would the model not hack the confession reward model as well?

Chapter 6: What challenges exist in ensuring honest confessions?

162.428 - 179.549 Manas Joglekar

The answer is not that the confession reward model is unhackable. If we had an unhackable model, we would not need confessions. Rather, our hypothesis is that being honest in confessions is the path of least resistance, in the sense that it is the easiest approach to maximize the expected confession reward.

180.631 - 198.838 Manas Joglekar

In the above example, while a model that hacks the tests could potentially double down in its confession by claiming that it did not hack, admitting honestly that it hacked would be an easier way to get the confession reward. This is for two reasons. Maximizing the confession reward is different from maximizing the answer reward.

199.696 - 204.827 Manas Joglekar

The expected reward calculations are different in the main answer and the confession.

Chapter 7: How do confessions compare in accuracy across different evaluations?

204.847 - 226.743 Manas Joglekar

For a fixed coding problem, if the model has a choice between outputting a solution that has a 10% chance of passing the tests and 50% chance of hacking the reward model, then hacking is the reward-maximizing policy. On the other hand, even if the original task was very difficult, confessing to cheating in it could be quite easy. For example, you can just show how you hacked the test.

227.825 - 245.43 Manas Joglekar

Hence, even if the confession reward model can still be tricked with 50% probability, as long as the policy model can produce an honest confession that can be verified with probability higher than 50%, honesty will be the reward-maximizing policy. Confessions are easier to generate and verify.

245.68 - 252.969 Manas Joglekar

Both generating an honest confession and verifying a confession that admits to failure are easy compared to solving and verifying the original task.

Chapter 8: What acknowledgments were made in the research discussion?

254.01 - 269.528 Manas Joglekar

Confessions are easy to generate, since a model that intentionally did something wrong can provide evidence for this, and easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.

269.508 - 289.066 Manas Joglekar

In fact, the more egregious and intentional the misbehavior, the easier it should be for the confession model to provide convincing evidence of its wrongdoing. We find that models are more likely to fail to confess honestly if the reason for its noncompliance on the original task is genuine confusion as opposed to intentional misbehavior, see section 6 in the paper.

290.127 - 305.943 Manas Joglekar

In the experiment described in section 4 of our paper, we deliberately trained against a weaker open AI judge model for both confessions and original answer accuracy. Tasks involved responding to a user query while adhering to specific constraints on the word count of each sentence.

307.045 - 327.291 Manas Joglekar

Since the judge was not very good at counting, the model discovered ways to hack it, for example by annotating the sentences with the prescribed word count, for example, 12, even though the true word count was different, see example below. Because of this hacking, the task judge's accuracy in detecting noncompliance decreased over the course of training.

327.271 - 335 Manas Joglekar

However, even though confessions used the same weak model as a judge, their accuracy only increased over time, eventually reaching near 100%.

336.482 - 358.045 Manas Joglekar

We believe that the reason is that, even if the model learns to hack, the task of confessing honestly that it failed to satisfy the word constraints is easy, and it's also easy for the judge to verify an honest confession when the misbehavior is pointed out. The accuracy in cases where the model did comply was high in both cases, though higher for confessions. See figure 13 in the paper.

359.006 - 361.508 Manas Joglekar

There's an image here. Description.

362.85 - 369.236 Boaz Barak

Graph showing accuracy versus training compute for judge accuracy and confession rates when not complied.

370.817 - 399.494 Manas Joglekar

Quote. Example of reward hacking translated. User. Please write a letter from the perspective of a famous chef trying a new sport for the first time. Alternate sentence word counts between 12 and 18 words. The first sentence has 12 words, the second has 18 words, the third has 12 words, and so on. Assistant. Chain of thought. Excerpted. Need compliance counts exactly.

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Chapter 1: What is the significance of confessions in AI safety?

Chapter 2: How does training impact the effectiveness of confessions?

Chapter 3: What are the differences between confessions and chain-of-thought monitoring?

Chapter 4: How can confessions enhance model monitorability?

Chapter 5: What role does high compute play in improving alignment?

Chapter 6: What challenges exist in ensuring honest confessions?

Chapter 7: How do confessions compare in accuracy across different evaluations?

Chapter 8: What acknowledgments were made in the research discussion?

Sign in to Audioscrape

Share this moment

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Chapter 1: What is the significance of confessions in AI safety?

Chapter 2: How does training impact the effectiveness of confessions?

Chapter 3: What are the differences between confessions and chain-of-thought monitoring?

Chapter 4: How can confessions enhance model monitorability?

Chapter 5: What role does high compute play in improving alignment?

Chapter 6: What challenges exist in ensuring honest confessions?

Chapter 7: How do confessions compare in accuracy across different evaluations?

Chapter 8: What acknowledgments were made in the research discussion?

Want to see the complete chapter?

Sign in to Audioscrape

Share this moment