Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

Strata: Efficient Hierarchical Context Caching for LLM Serving

26 Oct 2025

Description

The August 26, 2025 collaboration between Stanford, NVIDIA, Shanghai Jiao Tong University, University of Michigan, University of Colorado Boulder, Carnegie Mellon University introduces **Strata**, a hierarchical context caching framework designed to improve the performance of serving Large Language Models (LLMs) with long context windows. The core problem Strata addresses is that while caching key-value (KV) states is essential for efficiency, transferring large, fragmented cached contexts from slower memory tiers (like CPU memory) back to the GPU creates **severe I/O bottlenecks and performance stalls**. It also describes why paged attention creates data fragmentation when offloading even though its goal is to address memory fragmentation. That is paged attention becomes an issue when using offloading due to large contexts. Strata overcomes these issues through two main innovations: **GPU-assisted I/O** to mitigate data fragmentation and achieve high bandwidth utilization, and **cache-aware request scheduling** to intelligently form balanced batches and overlap unavoidable I/O stalls with complementary tasks. The evaluation shows that Strata significantly reduces the **Time-To-First-Token (TTFT)** and increases throughput compared to state-of-the-art serving systems like vLLM + LMCache and TensorRT-LLM on long-context benchmarks.Source:https://arxiv.org/html/2508.18572v1

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.