Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 - Dimensional analysis 1:23:32 - RLHF problems, in theory and practice 1:31:29 - Scott's research program 1:39:42 - Following Scott's research Scott's website: https://www.scottemmons.com Scott's X/twitter account: https://x.com/emmons_scott When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747 Other works we discuss: AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752 Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394 Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475 The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693 Episode art by Hamish Doodles: hamishdoodles.com
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Trumps irre Milliarden-Fusion und Win-Win-Deal für Netflix
19 Dec 2025
Alles auf Aktien – Die täglichen Finanzen-News
PL Striker Transfer Grades
18 Dec 2025
ESPN FC
TNB Tech Minute: FTC Orders Instacart to Pay $60 Million Over Deceptive Practices
18 Dec 2025
WSJ Tech News Briefing
Hidden Gem Stocks We Love at the End of the Year
18 Dec 2025
Motley Fool Money