Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in "Looking Inward" 0:15:11 Why fine-tune for introspection? 0:22:32 Does "Looking Inward" test introspection, or something else? 0:34:14 Interpreting the results of "Looking Inward" 0:44:56 Limitations to introspection? 0:49:54 "Tell me about yourself", and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to "Emergent Misalignment" 2:03:10 Reception of "Emergent Misalignment" vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research Links for Owain: Truthful AI: https://www.truthfulai.org Owain's website: https://owainevans.github.io/ Owain's twitter/X account: https://twitter.com/OwainEvans_UK Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424 X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852 Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667 Episode art by Hamish Doodles: hamishdoodles.com
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
Buchladen: Tipps für Weihnachten
20 Dec 2025
eat.READ.sleep. Bücher für dich
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Christmas Party, Debris & Ping-Pong
19 Dec 2025
My Therapist Ghosted Me
Episode 1320: Becoming 'The Monk': Rex Ryan on playing Gerry Hutch on stage (Part 1)
19 Dec 2025
Crime World
Friends Thru A Lens: The Holidays with Ella Risbridger
19 Dec 2025
Sentimental Garbage