AI: post transformers
Mechanistic interpretability: Decoding the AI's Inner Logic: Circuits and Sparse Features
15 Nov 2025
Ten different sources are used in this episode which are excerpts from academic papers and technical reports focusing on mechanistic interpretability and sparse autoencoders in language models (LLMs) and vision-language models (VLMs). This episode explores the state-of-the-art in **Mechanistic Interpretability** (MI), focusing on how researchers are decomposing large language models (LLMs) and multimodal models (MLLMs) into understandable building blocks. A central theme is the power of **Sparse Autoencoders (SAEs)**, which address the issue of polysemanticity—where a single neuron represents many unrelated concepts—by training overcomplete bases to extract sparse, **monosemantic features**. The episode would detail the successful scaling of SAEs to production models like Claude 3 Sonnet and Claude 3.5 Haiku, demonstrating that these techniques reveal features that are often abstract, multilingual, and even generalize across modalities (from text to images). Listeners would learn how advanced techniques like **Specialized SAEs (SSAEs)** are developed using dense retrieval to target and interpret rare or domain-specific "dark matter" concepts, such as specialized physics knowledge or toxicity patterns, that are often missed by general methods. The fundamental goal is establishing a linear representation of concepts that facilitates precise understanding and, crucially, manipulation of model internals.The second half of the episode dives into the application of these features to trace computational pathways, or **circuits**, using tools like **attribution graphs** and causal interventions. We explore concrete discoveries regarding LLM reasoning, such as identifying the modular circuit components—like queried-rule locating, fact-processing, and decision heads—that execute propositional logic and multi-step reasoning. We review how these mechanistic insights enable **precise control**, such as editing a model's diagnostic hypothesis (e.g., in medical scenarios) or circumventing refusal behaviors (jailbreaks) by overriding harmful request features. We cover cutting-edge intervention methods like **Attenuation via Posterior Probabilities (APP)**, which leverages the improved separation of concepts achieved by SAEs to perform highly effective and minimally disruptive concept erasure.Sources:1. 2025, Carnegie Mellon University: https://aclanthology.org/2025.findings-naacl.87.pdf (Source for Specialized Sparse Autoencoders)2. 2025, OpenAI: (Implicit Source: PDF for the paper titled "Weight-sparse transformers have interpretable circuits," attributed to an OpenAI author)3. 2024, Anthropic: (Implied Source URL for the work "Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet," published May 21, 2024)4. 2024, Anthropic: The claude 3 model family: Opus, sonnet, haiku (URL/document cited in circuit analysis work)5. 2024, Gemma Team: https://arxiv.org/abs/2408.00118 (Gemma 2: Improving open language models at a practical size)6. 2024, OpenAI: https://openai.com/index/learning-to-reason-with-llms/ (Learning to reason with LLMs)7. 2023, Transformer Circuits Thread: https://transformer-circuits.pub/2023/monosemantic-features/index.html (Towards Monosemanticity: Decomposing Language Models With Dictionary Learning)8. 2022, AI Alignment Forum: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing (Causal scrubbing)9. 2022, Transformer Circuits Thread: https://transformer-circuits.pub/2022/solu/index.html (Softmax Linear Units)10. 2021, Transformer Circuits Thread: https://transformer-circuits.pub/2021/framework/index.html (A mathematical framework for transformer circuits)
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
Eric Larsen on the emergence and potential of AI in healthcare
10 Dec 2025
McKinsey on Healthcare
Reducing Burnout and Boosting Revenue in ASCs
10 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Erich G. Anderer, Chief of the Division of Neurosurgery and Surgical Director of Perioperative Services at NYU Langone Hospital–Brooklyn
09 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Nolan Wessell, Assistant Professor and Well-being Co-Director, Department of Orthopedic Surgery, Division of Spine Surgery, University of Colorado School of Medicine
08 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
NPR News: 12-08-2025 2AM EST
08 Dec 2025
NPR News Now
NPR News: 12-08-2025 1AM EST
08 Dec 2025
NPR News Now