Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI Breakdown

Arxiv paper - Token-Efficient Long Video Understanding for Multimodal LLMs

18 Jun 2025

Description

In this episode, we discuss Token-Efficient Long Video Understanding for Multimodal LLMs by Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon. The paper introduces STORM, a new architecture that incorporates a temporal encoder using the Mamba State Space Model to better capture temporal dynamics in video-based multimodal large language models. This approach enables effective token reduction, significantly lowering computational costs and latency while preserving essential temporal information. Experiments demonstrate that STORM achieves state-of-the-art performance on long video understanding benchmarks with substantial improvements in efficiency and accuracy.

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.