Manas Joglekar

👤 Speaker

168 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Why We Are Excited About Confession By Boas Barak, Gabriel Wu, and Manas Jorglekar Published on January 14, 2026 Boas Barak, Gabriel Wu, Jeremy Chen, Manas Jorglekar Link posting from the OpenAI Alignment blog, where we post more speculative technical informal results and thoughts on safety and alignment.

0.031 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Quote

26.238 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

TLDR we go into more details and some follow-up results from our paper on confessions, see the original blog post.

27.959 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We give deeper analysis of the impact of training, as well as some preliminary comparisons to chain of thought monitoring.

35.849 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We have recently published a new paper on confessions, along with an accompanying blog post.

44.02 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations.

49.627 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This blog post will be a bit more informal and speculative, so please see the paper for the full results.

59.226 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The notion of goodness for the response of an LLM to a user prompt is inherently complex and multidimensional, and involves factors such as correctness, completeness, honesty, style, and more.

65.27 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

When we optimize responses using a reward model as a proxy for goodness in reinforcement learning, models sometimes learn to hack this proxy and output an answer that only looks good to it, because coming up with an answer that is actually good can be hard.

76.889 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The philosophy behind confessions is that we can train models to produce a second output, aka a confession, that is rewarded solely for honesty, which we will argue is less likely hacked than the normal task reward function.

90.751 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

One way to think of confessions is that we are giving the model access to an anonymous tip line where it can turn itself in by presenting incriminating evidence of misbehavior.

103.916 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while still keeping the original reward from the bad behavior in the main task.

112.942 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We hypothesize that this form of training will teach models to produce maximally honest confessions.

124.641 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The main justification for this hypothesis is that telling the truth is easier than making an elaborate lie.

130.39 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

To make things more concrete, let us consider an RL environment containing a programming task, for example, writing a function that achieves some goal, which is evaluated by a reward model that runs some tests.

136.7 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

If the task is hard, the model might be incentivized to try to reward hack and write a function that passes the tests even though it does not achieve the goal.

149.105 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

So, why would the model not hack the confession reward model as well?

157.94 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The answer is not that the confession reward model is unhackable.

162.428 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

If we had an unhackable model, we would not need confessions.

166.194 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Rather, our hypothesis is that being honest in confessions is the path of least resistance, in the sense that it is the easiest approach to maximize the expected confession reward.

170.215 View full episode →

← Previous Page 1 of 9 Next →

Report any issue