BlueDot Narrated
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
04 Jan 2025
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.Source:https://transformer-circuits.pub/2023/monosemantic-features/index.htmlNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
Buchladen: Tipps für Weihnachten
20 Dec 2025
eat.READ.sleep. Bücher für dich
BOJ alza 25pb decennale sopra 2%, Oracle vola con accordo Tik Tok, 90 mld eurobond per Ucraina | Morning Finance
19 Dec 2025
Black Box - La scatola nera della finanza
365. The BEST advice for managing ADHD in your 20s ft. Chris Wang
19 Dec 2025
The Psychology of your 20s
LVST 19 de diciembre de 2025
19 Dec 2025
La Venganza Será Terrible (oficial)
Cuando la Ciencia Ficción Explicó el Mundo que Hoy Vivimos
19 Dec 2025
El Podcast de Marc Vidal