Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI Podcast

FlashInfer:面向大语言模型推理服务的可定制高效注意力引擎

18 Mar 2025

Description

本播客深入探讨FlashInfer,这是一种专为大语言模型(LLM)推理服务设计的高效且可定制的注意力引擎。FlashInfer通过块稀疏格式和可组合格式解决KV缓存存储异构性,优化内存访问并减少冗余。它还提供可定制的注意力模板,通过即时编译适应各种设置。此外,FlashInfer的负载均衡调度算法适应用户请求的动态性,同时保持与CUDAGraph的兼容性。

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.