Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

FlexGen: High-Throughput LLM Inference on a Single GPU

15 Sep 2025

Description

This June 2023 paper introduces FlexGen, a novel high-throughput generation engine designed to overcome the substantial computational and memory demands of large language model (LLM) inference on limited hardware, specifically a single commodity GPU. It details FlexGen's ability to aggregate memory and computation across the GPU, CPU, and disk, employing an optimized scheduling approach and a linear programming-based policy search to store and access tensors efficiently. Furthermore, FlexGen incorporates 4-bit compression for model weights and attention caches, which significantly reduces memory footprint with minimal accuracy loss. The research demonstrates FlexGen's superior performance, achieving substantially higher throughput compared to existing offloading systems, even enabling the operation of models as large as OPT-175B on a single 16GB GPU.Source:https://arxiv.org/pdf/2303.06865

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.