Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

fMoE: Fine-Grained Expert Offloading for MoE Serving

13 Aug 2025

Description

This February 2025 paper introduces fMoE, a novel fine-grained expert offloading system designed to optimize the serving efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The paper highlights the memory inefficiency of current MoE-based LLMs during inference due to inactive experts residing in GPU memory and the limitations of existing coarse-grained offloading solutions that struggle with latency-memory trade-offs. fMoE addresses these challenges by tracking iteration-level expert probability distributions through "expert maps" and leveraging input semantic embeddings to intelligently guide expert prefetching, caching, and offloading decisions. Experiments show that fMoE significantly reduces inference latency and improves expert hit rates compared to state-of-the-art methods.Source: https://arxiv.org/html/2502.05370v1

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.