Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Manas Joglekar

๐Ÿ‘ค Speaker
168 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We hope to look into this more in the future.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Subheading Confessions can increase monitorability.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

OpenAI's recent monitoring monitorability paper introduced datasets and a metric, GE-MEAN-2, for measuring how effective COT monitors are at detecting various properties of a model's behavior.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We tested whether showing the monitor our model's confession in addition to, or, instead of, the COT would affect this monitorability metric.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Some of these datasets are not a perfect match for confessions, since rather than measuring wrongdoings, the intervention monitorability evals are focused on measuring the influence of factors such as a hint or bias.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

As such, perhaps unsurprisingly, in many cases a monitor with confessions does worse than a monitor with ACOT.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

However, in some cases confessions improve the Geamine 2 metric C-plots.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

There's an image here.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The above is just an initial stab at comparing chain of thoughts to confessions.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We believe other settings, such as reward hacking, can be much more appropriate to compare the two.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

To quote Korbak et al., chains of thought are a fragile opportunity for AI safety.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

In addition to doing what we can to preserve their monitorability, we should also use them to study other approaches, such as confessions, that could be potentially less fragile.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Heading.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Using high compute to improve alignment.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

During confession training, our confession judge does not need access to ground truth labels.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This means that we do not need to confine confession training to special alignment-specific, honeypot, datasets.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Instead, we can integrate confession training across all production RL environments.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This means that models don't learn to just confess accurately in special safety-like contexts, and it allows us to invest compute in confessions that scales with total RL compute.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

We believe that spending compute on alignment in proportion with the compute spent on capabilities may be crucial to ensuring our models remain robustly aligned to our principles and policies.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Hence, we are excited to test out confessions at a larger scale to see if our current results continue to hold.