arxiv preprint - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models - AI Breakdown | Transcription & Insights

Audio

Description

In this episode, we discuss LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models by Yanwei Li, Chengyao Wang, Jiaya Jia. The paper introduces a new approach named LLaMA-VID for improving the processing of lengthy videos in Vision Language Models (VLMs) by using a dual token system: a context token and a content token. The context token captures the overall image context while the content token targets specific visual details in each frame, which tackles the issue of computational strain in handling extended video content. LLaMA-VID enhances VLM capabilities for long-duration video understanding and outperforms existing methods in various video and image benchmarks, with the code made available online. Code is avail- able at https://github.com/dvlab-research/LLaMA-VID.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

AI Breakdown

arxiv preprint - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

This episode hasn't been transcribed yet

Other recent transcribed episodes

13:00H | 21 DIC 2025 | Fin de Semana

10:00H | 21 DIC 2025 | Fin de Semana

12:00H | 20 DIC 2025 | Fin de Semana

2ª PARTE | 06 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 22 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 04 MAR 2026 | EL PARTIDAZO DE COPE

Sign in to Audioscrape

Share this moment