Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

Elastic-Cache: Adaptive KV Caching for Diffusion LLMs

22 Oct 2025

Description

The October 16, 2025 academic paper introduces **Elastic-Cache**, an innovative, training-free strategy designed to significantly accelerate the inference speed of diffusion large language models (DLMs) by optimizing Key-Value (KV) cache management. Standard DLMs suffer from slow decoding because they redundantly recompute the KV cache for all tokens at every step, despite minimal changes, especially in shallow layers; Elastic-Cache addresses this by introducing an **adaptive, layer-aware refresh policy**. This policy uses a lightweight **attention-aware drift test** on the most-attended token to determine *when* a refresh is necessary and employs a **depth-aware schedule** to decide *where* to recompute, focusing only on deeper, more volatile layers. Experiments demonstrate that this approach achieves substantial throughput speedups—up to 45.1× on longer sequences—with negligible loss in accuracy compared to baseline and fixed-period caching methods. The method also incorporates **block-wise caching** for distant MASK tokens to further reduce computational overhead.Source:https://arxiv.org/pdf/2510.14973

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.