Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

Mixture-of-Depths: Dynamic Compute Allocation in Transformers

19 Nov 2025

Description

These April 4, 2024 Google Deepmind paper introduces the **Mixture-of-Depths (MoD)** transformer architecture, a method that improves efficiency by learning to dynamically allocate compute to only the necessary tokens within a sequence. This is achieved by setting a static capacity, C (or k), which **limits the total number of tokens** that can participate in the expensive self-attention and Multi-Layer Perceptron (MLP) computations at any given layer. The sources explain that this capacity limitation is key to compute reduction, citing that if capacity is halved, the self-attention operation becomes only **25% as intensive** due to the squared relationship of the tokens involved. Beyond compute savings, the constraint forces the network to **learn which tokens matter**, which, in turn, allows MoD models to match or exceed the performance of baseline transformers while using fewer FLOPs per forward pass. Crucially, the MoD method uses an expert-choice routing scheme and a defined capacity to ensure a **static computation graph**, which is vital for maintaining high hardware efficiency during training and inference, and also anticipates potential reductions in Key-Value (KV) cache memory.Source:https://arxiv.org/pdf/2404.02258

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.