Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AXRP - the AI X-risk Research Podcast

40 - Jason Gross on Compact Proofs and Interpretability

28 Mar 2025

Description

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html   Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work   Links to Jason: Github: https://github.com/jasongross Website: https://jasongross.github.io Alignment Forum: https://www.alignmentforum.org/users/jason-gross   Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926 Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf     Episode art by Hamish Doodles: hamishdoodles.com

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.