Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AXRP - the AI X-risk Research Podcast

33 - RLHF Problems with Scott Emmons

12 Jun 2024

Description

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 - Dimensional analysis 1:23:32 - RLHF problems, in theory and practice 1:31:29 - Scott's research program 1:39:42 - Following Scott's research   Scott's website: https://www.scottemmons.com Scott's X/twitter account: https://x.com/emmons_scott When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747   Other works we discuss: AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752 Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394 Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475 The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693   Episode art by Hamish Doodles: hamishdoodles.com

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.