Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

ShadowKV: High-Throughput Long-Context LLM Inference

17 Sep 2025

Description

This April 2025 paper introduces ShadowKV, an innovative inference system for long-context Large Language Models (LLMs) designed to significantly enhance throughput and support larger batch sizes without compromising accuracy. It achieves this by strategically managing the Key-Value (KV) cache: specifically, it compresses the low-rank pre-Rotary Position Embedding (RoPE) key cache on the GPU and offloads the value cache to the CPU. ShadowKV further optimizes performance through an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly, thus minimizing decoding latency. Empirical evaluations demonstrate that ShadowKV can support up to 6x larger batch sizes and boost throughput by up to 3.04x on an A100 GPU across various LLMs and benchmarks, even outperforming theoretical infinite memory scenarios.Source:https://arxiv.org/pdf/2410.21465

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.