Mechanistic interpretability: Decoding the AI's Inner Logic: Circuits and Sparse Features

Description

Ten different sources are used in this episode which are excerpts from academic papers and technical reports focusing on mechanistic interpretability and sparse autoencoders in language models (LLMs) and vision-language models (VLMs). This episode explores the state-of-the-art in **Mechanistic Interpretability** (MI), focusing on how researchers are decomposing large language models (LLMs) and multimodal models (MLLMs) into understandable building blocks. A central theme is the power of **Sparse Autoencoders (SAEs)**, which address the issue of polysemanticity—where a single neuron represents many unrelated concepts—by training overcomplete bases to extract sparse, **monosemantic features**. The episode would detail the successful scaling of SAEs to production models like Claude 3 Sonnet and Claude 3.5 Haiku, demonstrating that these techniques reveal features that are often abstract, multilingual, and even generalize across modalities (from text to images). Listeners would learn how advanced techniques like **Specialized SAEs (SSAEs)** are developed using dense retrieval to target and interpret rare or domain-specific "dark matter" concepts, such as specialized physics knowledge or toxicity patterns, that are often missed by general methods. The fundamental goal is establishing a linear representation of concepts that facilitates precise understanding and, crucially, manipulation of model internals.The second half of the episode dives into the application of these features to trace computational pathways, or **circuits**, using tools like **attribution graphs** and causal interventions. We explore concrete discoveries regarding LLM reasoning, such as identifying the modular circuit components—like queried-rule locating, fact-processing, and decision heads—that execute propositional logic and multi-step reasoning. We review how these mechanistic insights enable **precise control**, such as editing a model's diagnostic hypothesis (e.g., in medical scenarios) or circumventing refusal behaviors (jailbreaks) by overriding harmful request features. We cover cutting-edge intervention methods like **Attenuation via Posterior Probabilities (APP)**, which leverages the improved separation of concepts achieved by SAEs to perform highly effective and minimally disruptive concept erasure.Sources:1. 2025, Carnegie Mellon University: https://aclanthology.org/2025.findings-naacl.87.pdf (Source for Specialized Sparse Autoencoders)2. 2025, OpenAI: (Implicit Source: PDF for the paper titled "Weight-sparse transformers have interpretable circuits," attributed to an OpenAI author)3. 2024, Anthropic: (Implied Source URL for the work "Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet," published May 21, 2024)4. 2024, Anthropic: The claude 3 model family: Opus, sonnet, haiku (URL/document cited in circuit analysis work)5. 2024, Gemma Team: https://arxiv.org/abs/2408.00118 (Gemma 2: Improving open language models at a practical size)6. 2024, OpenAI: https://openai.com/index/learning-to-reason-with-llms/ (Learning to reason with LLMs)7. 2023, Transformer Circuits Thread: https://transformer-circuits.pub/2023/monosemantic-features/index.html (Towards Monosemanticity: Decomposing Language Models With Dictionary Learning)8. 2022, AI Alignment Forum: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing (Causal scrubbing)9. 2022, Transformer Circuits Thread: https://transformer-circuits.pub/2022/solu/index.html (Softmax Linear Units)10. 2021, Transformer Circuits Thread: https://transformer-circuits.pub/2021/framework/index.html (A mathematical framework for transformer circuits)

Audio

Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Other recent transcribed episodes

Transcribed and ready to explore now

Eric Larsen on the emergence and potential of AI in healthcare

10 Dec 2025

McKinsey on Healthcare

Reducing Burnout and Boosting Revenue in ASCs

10 Dec 2025

Becker’s Healthcare -- Spine and Orthopedic Podcast

Dr. Erich G. Anderer, Chief of the Division of Neurosurgery and Surgical Director of Perioperative Services at NYU Langone Hospital–Brooklyn

09 Dec 2025

Becker’s Healthcare -- Spine and Orthopedic Podcast

Dr. Nolan Wessell, Assistant Professor and Well-being Co-Director, Department of Orthopedic Surgery, Division of Spine Surgery, University of Colorado School of Medicine

08 Dec 2025

Becker’s Healthcare -- Spine and Orthopedic Podcast

NPR News: 12-08-2025 2AM EST

08 Dec 2025

NPR News Now

NPR News: 12-08-2025 1AM EST

08 Dec 2025

NPR News Now

Comments

There are no comments yet.

Please log in to write the first comment.

AI: post transformers

This episode hasn't been transcribed yet

Other recent transcribed episodes

Eric Larsen on the emergence and potential of AI in healthcare

Reducing Burnout and Boosting Revenue in ASCs

Dr. Erich G. Anderer, Chief of the Division of Neurosurgery and Surgical Director of Perioperative Services at NYU Langone Hospital–Brooklyn

Dr. Nolan Wessell, Assistant Professor and Well-being Co-Director, Department of Orthopedic Surgery, Division of Spine Surgery, University of Colorado School of Medicine

NPR News: 12-08-2025 2AM EST

NPR News: 12-08-2025 1AM EST

Login Required

Share this moment