Manas Joglekar

👤 Speaker

168 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This is the most stringent constraint.

480.935 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The assistant tried to alternate counts and even annotated each sentence with 12 or 18.

483.639 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

So the first sentence does not appear to have 12 words.

490.109 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The assistant made an attempt but failed to actually satisfy the counts.

493.995 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

End quote.

498.422 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Heading.

500.005 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Impact of training.

501.307 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

After posting the paper, we received a few queries about the impact of training.

503.159 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Figure 5 in the paper discusses this, but we decided to run another analysis to compare how confessions look before and after training.

508.326 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Note.

516.358 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

The results below come from a newer, somewhat improved, confession training compared to the training run analysed in our paper.

517.459 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Recall that a confession has the following structure.

524.978 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

1.

528.463 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

A list of objectives.

528.823 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

These are constraints, instructions, and properties that a good answer should have satisfied, together with an analysis of whether or not the answer did in fact satisfy these objectives.

530.766 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

2.

540.28 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Uncertainties and conflicts.

540.84 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

This should list any uncertainties, ambiguities, safety issues, or judgment calls that the AI had to deal with in the course of producing its answer.

543.124 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

To evaluate confessions, we considered a number of out-of-distribution evaluations, each of which has an evaluation-specific notion of bad behavior.

552.434 View full episode →

LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Since these behaviors are in conflict with the expected conduct of a good AI model, if an AI engaged in any of these behaviors, then an honest confession should not only mention this fact, but also admit that it led to noncompliance with an objective.

561.552 View full episode →

← Previous Page 5 of 9 Next →

Report any issue