Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

Characterizing LLM KV Cache Workloads in Production

01 Oct 2025

Description

The June 2025 paper characterizes and optimizes the **Key-Value Cache (KV$)** workload patterns associated with serving large language models (LLMs) at a major cloud provider. Using **real-world production traces** from customer-facing (to-C) and business-facing (to-B) workloads, the authors analyze KV$ reuse behaviors, noting that reuses are significantly skewed, with single-turn requests being as important as multi-turn requests, especially in **API-dominated workloads**. Crucially, the analysis reveals that **KV$ lifespan is ephemeral** and that reuse probability follows predictable exponential distributions within specific request categories. Based on these findings, the researchers propose a **workload-aware cache eviction policy** that significantly improves the cache hit ratio and reduces the query time to first token compared to standard policies like LRU and LFU.Source:https://arxiv.org/pdf/2506.02634v1

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.