Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.Original text:https://www.deepmind.com/publications/red-teaming-language-models-with-language-modelsNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
Buchladen: Tipps für Weihnachten
20 Dec 2025
eat.READ.sleep. Bücher für dich
BOJ alza 25pb decennale sopra 2%, Oracle vola con accordo Tik Tok, 90 mld eurobond per Ucraina | Morning Finance
19 Dec 2025
Black Box - La scatola nera della finanza
365. The BEST advice for managing ADHD in your 20s ft. Chris Wang
19 Dec 2025
The Psychology of your 20s
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Cuando la Ciencia Ficción Explicó el Mundo que Hoy Vivimos
19 Dec 2025
El Podcast de Marc Vidal