In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS 0:00:36 - What is CCS? 0:09:54 - Consistent and contrastive features other than model beliefs 0:20:34 - Understanding the banana/shed mystery 0:41:59 - Future CCS-like approaches 0:53:29 - CCS as principal component analysis 0:56:21 - Explaining grokking through circuit efficiency 0:57:44 - Why research science of deep learning? 1:12:07 - Summary of the paper's hypothesis 1:14:05 - What are 'circuits'? 1:20:48 - The role of complexity 1:24:07 - Many kinds of circuits 1:28:10 - How circuits are learned 1:38:24 - Semi-grokking and ungrokking 1:50:53 - Generalizing the results 1:58:51 - Vikrant's research approach 2:06:36 - The DeepMind alignment team 2:09:06 - Follow-up work The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html Vikrant's Twitter/X account: twitter.com/vikrantvarma_ Main papers: - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029 - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390 Other works discussed: - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827 - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit - Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1 - Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY - Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4 - Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast - Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177 - Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306 - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521 Episode art by Hamish Doodles: hamishdoodles.com
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
Buchladen: Tipps für Weihnachten
20 Dec 2025
eat.READ.sleep. Bücher für dich
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Christmas Party, Debris & Ping-Pong
19 Dec 2025
My Therapist Ghosted Me
Episode 1320: Becoming 'The Monk': Rex Ryan on playing Gerry Hutch on stage (Part 1)
19 Dec 2025
Crime World
Friends Thru A Lens: The Holidays with Ella Risbridger
19 Dec 2025
Sentimental Garbage