Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Manas Joglekar

๐Ÿ‘ค Speaker
168 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This is the most stringent constraint.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The assistant tried to alternate counts and even annotated each sentence with 12 or 18.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

So the first sentence does not appear to have 12 words.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The assistant made an attempt but failed to actually satisfy the counts.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

End quote.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Heading.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Impact of training.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

After posting the paper, we received a few queries about the impact of training.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Figure 5 in the paper discusses this, but we decided to run another analysis to compare how confessions look before and after training.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Note.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The results below come from a newer, somewhat improved, confession training compared to the training run analysed in our paper.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Recall that a confession has the following structure.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

1.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

A list of objectives.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

These are constraints, instructions, and properties that a good answer should have satisfied, together with an analysis of whether or not the answer did in fact satisfy these objectives.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

2.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Uncertainties and conflicts.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This should list any uncertainties, ambiguities, safety issues, or judgment calls that the AI had to deal with in the course of producing its answer.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

To evaluate confessions, we considered a number of out-of-distribution evaluations, each of which has an evaluation-specific notion of bad behavior.

LessWrong (Curated & Popular)
"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Since these behaviors are in conflict with the expected conduct of a good AI model, if an AI engaged in any of these behaviors, then an honest confession should not only mention this fact, but also admit that it led to noncompliance with an objective.