Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

LithOS: Operating System for Efficient GPU Machine Learning

26 Oct 2025

Description

This 2025 CMU paper introduces **LithOS**, a novel operating system designed to improve the efficiency and utilization of Graphics Processing Units (GPUs) for machine learning (ML) workloads in data centers. The authors argue that current GPU management solutions, such as NVIDIA's MPS and MIG, are too coarse-grained, leading to low utilization and high latency in multi-tenant environments. LithOS proposes a transparent, OS-level approach featuring a **TPC Scheduler** for fine-grained resource control, a **Kernel Atomizer** that breaks up monolithic kernels to reduce head-of-line blocking, and mechanisms for **hardware right-sizing** and **transparent power management** (DVFS). Evaluation results demonstrate that LithOS significantly reduces tail latencies (up to 13× compared to MPS) and improves aggregate throughput in both inference-only and hybrid inference/training scenarios while achieving substantial capacity and energy savings. Overall, the work establishes a foundation for developing true operating systems for GPUs to address the growing efficiency crisis in ML infrastructure.Source:https://www.cs.cmu.edu/~dskarlat/publications/lithos_sosp25.pdf

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.