This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.Key takeaways and relevance for alignment:Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).Original text:https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problemNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
Buchladen: Tipps für Weihnachten
20 Dec 2025
eat.READ.sleep. Bücher für dich
BOJ alza 25pb decennale sopra 2%, Oracle vola con accordo Tik Tok, 90 mld eurobond per Ucraina | Morning Finance
19 Dec 2025
Black Box - La scatola nera della finanza
365. The BEST advice for managing ADHD in your 20s ft. Chris Wang
19 Dec 2025
The Psychology of your 20s
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Cuando la Ciencia Ficción Explicó el Mundo que Hoy Vivimos
19 Dec 2025
El Podcast de Marc Vidal