AI Breakdown

Arxiv paper - VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

08 Mar 2025

Contributed by Lukas

In this episode, we discuss VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing by Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Y...

Arxiv paper - ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

04 Mar 2025

Contributed by Lukas

In this episode, we discuss ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models by Jonathan Roberts, Mohammad Reza Taes...

Arxiv paper - Teaching Language Models to Critique via Reinforcement Learning

03 Mar 2025

Contributed by Lukas

In this episode, we discuss Teaching Language Models to Critique via Reinforcement Learning by Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing ...

Arxiv paper - PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

27 Feb 2025

Contributed by Lukas

In this episode, we discuss PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling by Avery ...

Arxiv paper - VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

24 Feb 2025

Contributed by Lukas

In this episode, we discuss VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation by Sixiao Zheng, Zimian Peng, Yanpeng Zhou, ...

Arxiv paper - Heuristically Adaptive Diffusion-Model Evolutionary Strategy

22 Feb 2025

Contributed by Lukas

In this episode, we discuss Heuristically Adaptive Diffusion-Model Evolutionary Strategy by Benedikt Hartl, Yanbo Zhang, Hananel Hazan, Michael Levin....

Arxiv paper - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

20 Feb 2025

Contributed by Lukas

In this episode, we discuss Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach by Jonas Geiping, Sean McLeish, Neel Jain, ...

Arxiv paper - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

19 Feb 2025

Contributed by Lukas

In this episode, we discuss EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents by Rui Yang,...

Arxiv paper - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

14 Feb 2025

Contributed by Lukas

In this episode, we discuss VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection by Songhao...

Arxiv paper - VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

13 Feb 2025

Contributed by Lukas

In this episode, we discuss VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models by Hila Chefer, Uriel Sin...

Arxiv paper - HunyuanVideo: A Systematic Framework For Large Video Generative Models

12 Feb 2025

Contributed by Lukas

In this episode, we discuss HunyuanVideo: A Systematic Framework For Large Video Generative Models by Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuo...

Arxiv paper - s1: Simple test-time scaling

10 Feb 2025

Contributed by Lukas

In this episode, we discuss s1: Simple test-time scaling by Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirz...

Arxiv paper - Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

07 Feb 2025

Contributed by Lukas

In this episode, we discuss Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation by The authors of the paper are ...

Arxiv paper - MatAnyone: Stable Video Matting with Consistent Memory Propagation

07 Feb 2025

Contributed by Lukas

In this episode, we discuss MatAnyone: Stable Video Matting with Consistent Memory Propagation by Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao...

Arxiv paper - Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

03 Feb 2025

Contributed by Lukas

In this episode, we discuss Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate by Yubo Wang, Xiang Yue, Wenhu Chen....

Arxiv paper - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

31 Jan 2025

Contributed by Lukas

In this episode, we discuss Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs by Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xing...

Arxiv paper - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

30 Jan 2025

Contributed by Lukas

In this episode, we discuss MetaMorph: Multimodal Understanding and Generation via Instruction Tuning by Shengbang Tong, David Fan, Jiachen Zhu, Yunya...

Arxiv paper - Improving Video Generation with Human Feedback

29 Jan 2025

Contributed by Lukas

In this episode, we discuss Improving Video Generation with Human Feedback by Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zhen...

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

28 Jan 2025

Contributed by Lukas

In this episode, we discuss Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling by The authors of the paper are: - ...

Arxiv paper - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

27 Jan 2025

Contributed by Lukas

In this episode, we discuss DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by DeepSeek-AI. The paper introduces De...

Arxiv paper - Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

24 Jan 2025

Contributed by Lukas

In this episode, we discuss Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step by Ziyu Guo, Renrui Zhang, Cheng...

Arxiv paper - Improving Factuality with Explicit Working Memory

23 Jan 2025

Contributed by Lukas

In this episode, we discuss Improving Factuality with Explicit Working Memory by Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Ze...

Arxiv paper - Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

17 Jan 2025

Contributed by Lukas

In this episode, we discuss Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control by Zekai Gu, Rui Yan, Jiahao Lu, Peng...

Arxiv paper - FaceLift: Single Image to 3D Head with View Generation and GS-LRM

13 Jan 2025

Contributed by Lukas

In this episode, we discuss FaceLift: Single Image to 3D Head with View Generation and GS-LRM by Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu. Fac...

Arxiv paper - GenHMR: Generative Human Mesh Recovery

08 Jan 2025

Contributed by Lukas

In this episode, we discuss GenHMR: Generative Human Mesh Recovery by Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das...

Arxiv paper - Video Creation by Demonstration

06 Jan 2025

Contributed by Lukas

In this episode, we discuss Video Creation by Demonstration by Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Ad...

Arxiv paper - Byte Latent Transformer: Patches Scale Better Than Tokens

02 Jan 2025

Contributed by Lukas

In this episode, we discuss Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen,...

Arxiv paper - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

17 Dec 2024

Contributed by Lukas

In this episode, we discuss Align3R: Aligned Monocular Depth Estimation for Dynamic Videos by Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin...

Arxiv paper - FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

17 Dec 2024

Contributed by Lukas

In this episode, we discuss FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion by Haonan Qiu, Shiwei Zhang, Yujie W...

Arxiv paper - ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

11 Dec 2024

Contributed by Lukas

In this episode, we discuss ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis by Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo...

Arxiv paper - o1-Coder: an o1 Replication for Coding

10 Dec 2024

Contributed by Lukas

In this episode, we discuss o1-Coder: an o1 Replication for Coding by Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, Jit...

Arxiv paper - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

06 Dec 2024

Contributed by Lukas

In this episode, we discuss DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning by Hao Bai, Yifei Zhou, Mert Cem...

ICLR 2025 submission - CYCLE-CONSISTENT LEARNING FOR JOINT LAYOUT-TO-IMAGE GENERATION AND OBJECT DETECTION

03 Dec 2024

Contributed by Lukas

In this episode, we discuss CYCLE-CONSISTENT LEARNING FOR JOINT LAYOUT-TO-IMAGE GENERATION AND OBJECT DETECTION by The paper's authors are listed as "...

Arxiv Paper - WonderWorld: Interactive 3D Scene Generation from a Single Image

26 Nov 2024

Contributed by Lukas

In this episode, we discuss WonderWorld: Interactive 3D Scene Generation from a Single Image by Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T....

Arxiv Paper - Hymba: A Hybrid-head Architecture for Small Language Models

22 Nov 2024

Contributed by Lukas

In this episode, we discuss Hymba: A Hybrid-head Architecture for Small Language Models by Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen...

Arxiv Paper - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

21 Nov 2024

Contributed by Lukas

In this episode, we discuss Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation by Danny Halawi, Alexander Wei, Eric Wallace, Tony ...

Arxiv Paper - Video Instruction Tuning With Synthetic Data

20 Nov 2024

Contributed by Lukas

In this episode, we discuss Video Instruction Tuning With Synthetic Data by Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li...

Arxiv Paper - Generative Agent Simulations of 1,000 People

19 Nov 2024

Contributed by Lukas

In this episode, we discuss Generative Agent Simulations of 1,000 People by Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai...

NeurIPS 2024 - Moving Off-the-Grid: Scene-Grounded Video Representations

15 Nov 2024

Contributed by Lukas

In this episode, we discuss Moving Off-the-Grid: Scene-Grounded Video Representations by Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova,...

Arxiv Paper - Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

14 Nov 2024

Contributed by Lukas

In this episode, we discuss Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution by Peng Wang, Shuai Bai, Sinan Tan, ...

Arxiv Paper - FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

13 Nov 2024

Contributed by Lukas

In this episode, we discuss FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality by Zhengyao Lv, Chenyang Si, Junhao Song, ...

Arxiv Paper - Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

11 Nov 2024

Contributed by Lukas

In this episode, we discuss Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA by Sangmin Bae, Adam Fisch, Hrayr Harutyu...

Arxiv Paper - Long Context RAG Performance of Large Language Models

08 Nov 2024

Contributed by Lukas

In this episode, we discuss Long Context RAG Performance of Large Language Models by Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carb...

Arxiv Paper - NVLM: Open Frontier-Class Multimodal LLMs

05 Nov 2024

Contributed by Lukas

In this episode, we discuss NVLM: Open Frontier-Class Multimodal LLMs by Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tu...

Arxiv Paper - ColPali: Efficient Document Retrieval with Vision Language Models

01 Nov 2024

Contributed by Lukas

In this episode, we discuss ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani,...

Arxiv Paper - Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

31 Oct 2024

Contributed by Lukas

In this episode, we discuss Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models by Matt Deitke, Christopher Clark, Sang...

Arxiv Paper - Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

31 Oct 2024

Contributed by Lukas

In this episode, we discuss Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization by Mohammad Samragh, Iman Mi...

Arxiv Paper - Unbounded: A Generative Infinite Game of Character Life Simulation

29 Oct 2024

Contributed by Lukas

In this episode, we discuss Unbounded: A Generative Infinite Game of Character Life Simulation by Jialu Li, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Dav...

Arxiv Paper - Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?

28 Oct 2024

Contributed by Lukas

In this episode, we discuss Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? by Nishant Balepur, Feng Gu...

Arxiv Paper - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

25 Oct 2024

Contributed by Lukas

In this episode, we discuss LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding by Xiaoqian Shen, Yunyang Xiong, Changsh...

Arxiv Paper - When Does Perceptual Alignment Benefit Vision Representations?

23 Oct 2024

Contributed by Lukas

In this episode, we discuss When Does Perceptual Alignment Benefit Vision Representations? by Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Net...

Arxiv paper - SceneCraft: Layout-Guided 3D Scene Generation

22 Oct 2024

Contributed by Lukas

In this episode, we discuss SceneCraft: Layout-Guided 3D Scene Generation by Xiuyu Yang, Yunze Man, Jun-Kun Chen, Yu-Xiong Wang. SceneCraft is a metho...

arxiv preprint - A Tale of Tails: Model Collapse as a Change of Scaling Laws

18 Oct 2024

Contributed by Lukas

In this episode, we discuss A Tale of Tails: Model Collapse as a Change of Scaling Laws by Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Ju...

arxiv preprint - Thinking LLMs: General Instruction Following with Thought Generation

17 Oct 2024

Contributed by Lukas

In this episode, we discuss Thinking LLMs: General Instruction Following with Thought Generation by Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao,...

arxiv preprint - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

16 Oct 2024

Contributed by Lukas

In this episode, we discuss Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think by Sihyun Yu, Sangkyung ...

arxiv preprint - F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

14 Oct 2024

Contributed by Lukas

In this episode, we discuss F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching by Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi...

arxiv preprint - One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

11 Oct 2024

Contributed by Lukas

In this episode, we discuss One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation by Fabian Paischer, Lukas Hauzenberger,...

arxiv preprint - Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

10 Oct 2024

Contributed by Lukas

In this episode, we discuss Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models by Seyedmorteza Sadat, Otmar Hilliges...

arxiv preprint - NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING

07 Oct 2024

Contributed by Lukas

In this episode, we discuss NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING by The authors of the paper "NEPTUNE: THE LONG ORBIT TO B...

arxiv preprint - SHIC: Shape-Image Correspondences with no Keypoint Supervision

04 Oct 2024

Contributed by Lukas

In this episode, we discuss SHIC: Shape-Image Correspondences with no Keypoint Supervision by Aleksandar Shtedritski, Christian Rupprecht, Andrea Veda...

arxiv preprint - E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

03 Oct 2024

Contributed by Lukas

In this episode, we discuss E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding by Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying...

arxiv preprint - LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

01 Oct 2024

Contributed by Lukas

In this episode, we discuss LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness by Chenming Zhu, Tai Wang, Wenwei Zhang, Jia...

arxiv preprint - DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

28 Sep 2024

Contributed by Lukas

In this episode, we discuss DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos by Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie...

arxiv preprint - Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

27 Sep 2024

Contributed by Lukas

In this episode, we discuss Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale by Fan Zhou, Zengzhi Wang, Qian Liu, Ju...

arxiv preprint - Phantom of Latent for Large Language and Vision Models

24 Sep 2024

Contributed by Lukas

In this episode, we discuss Phantom of Latent for Large Language and Vision Models by Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong...

arxiv preprint - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

20 Sep 2024

Contributed by Lukas

In this episode, we discuss Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think by Gonzalo Martin Garcia, Karim Abou Zeid, Christi...

arxiv preprint - On the Diagram of Thought

19 Sep 2024

Contributed by Lukas

In this episode, we discuss On the Diagram of Thought by Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao. Diagram of Thought (DoT) is a framework for mode...

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

17 Sep 2024

Contributed by Lukas

In this episode, we discuss Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources by Alisia Lupidi, Carlos Gemmell, Nicol...

arxiv preprint - SongCreator: Lyrics-based Universal Song Generation

12 Sep 2024

Contributed by Lukas

In this episode, we discuss SongCreator: Lyrics-based Universal Song Generation by Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu ...

arxiv preprint - Achieving Human Level Competitive Robot Table Tennis

11 Sep 2024

Contributed by Lukas

In this episode, we discuss Achieving Human Level Competitive Robot Table Tennis by David B. D'Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen...

arxiv preprint - Sapiens: Foundation for Human Vision Models

09 Sep 2024

Contributed by Lukas

In this episode, we discuss Sapiens: Foundation for Human Vision Models by Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin Jam...

arxiv preprint - Re-Reading Improves Reasoning in Large Language Models

06 Sep 2024

Contributed by Lukas

In this episode, we discuss Re-Reading Improves Reasoning in Large Language Models by Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong ...

arxiv preprint - SPIRE: Semantic Prompt-Driven Image Restoration

03 Sep 2024

Contributed by Lukas

In this episode, we discuss SPIRE: Semantic Prompt-Driven Image Restoration by Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanf...

arxiv preprint - Automated Design of Agentic Systems

31 Aug 2024

Contributed by Lukas

In this episode, we discuss Automated Design of Agentic Systems by Shengran Hu, Cong Lu, Jeff Clune. The paper introduces Automated Design of Agentic ...

arxiv preprint - Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

28 Aug 2024

Contributed by Lukas

In this episode, we discuss Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model by Chunting Zhou, Lili Yu, Arun Babu, Ku...

arxiv preprint - To Code, or Not To Code? Exploring Impact of Code in Pre-training

26 Aug 2024

Contributed by Lukas

In this episode, we discuss To Code, or Not To Code? Exploring Impact of Code in Pre-training by Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Moriso...

arxiv preprint - Segment Anything with Multiple Modalities

23 Aug 2024

Contributed by Lukas

In this episode, we discuss Segment Anything with Multiple Modalities by Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu. The pap...

arxiv preprint - JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

20 Aug 2024

Contributed by Lukas

In this episode, we discuss JPEG-LM: LLMs as Image Generators with Canonical Codec Representations by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei K...

arxiv preprint - Mission: Impossible Language Models

19 Aug 2024

Contributed by Lukas

In this episode, we discuss Mission: Impossible Language Models by Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, Christopher Po...

arxiv preprint - Learning Task Decomposition to Assist Humans in Competitive Programming

16 Aug 2024

Contributed by Lukas

In this episode, we discuss Learning Task Decomposition to Assist Humans in Competitive Programming by Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, ...

arxiv preprint - IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

13 Aug 2024

Contributed by Lukas

In this episode, we discuss IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts by Ciara Rowles, Shimon Vainer,...

arxiv preprint - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

10 Aug 2024

Contributed by Lukas

In this episode, we discuss Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters by Charlie Snell, Jaehoon Lee,...

arxiv preprint - Language Model Can Listen While Speaking

09 Aug 2024

Contributed by Lukas

In this episode, we discuss Language Model Can Listen While Speaking by Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan ...

arxiv preprint - Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

07 Aug 2024

Contributed by Lukas

In this episode, we discuss Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning by Trapoom Ukarapol, Zhicheng Lee, Amy...

arxiv preprint - Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

06 Aug 2024

Contributed by Lukas

In this episode, we discuss Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle by Zhenyu Tang, Junwu Zhan...

arxiv preprint - Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

06 Aug 2024

Contributed by Lukas

In this episode, we discuss Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent by Shanbo Cheng, Zhichao Huang,...

arxiv preprint - Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

31 Jul 2024

Contributed by Lukas

In this episode, we discuss Graph-enhanced Large Language Models in Asynchronous Plan Reasoning by Fangru Lin, Emanuele La Malfa, Valentin Hofmann, El...

arxiv preprint - LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

30 Jul 2024

Contributed by Lukas

In this episode, we discuss LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference by Qichen Fu, Minsik Cho, Thomas Merth, Sachin Meh...

arxiv preprint - OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

29 Jul 2024

Contributed by Lukas

In this episode, we discuss OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person by Ke Sun, Jian Cao, Qi Wang, Linrui Tian,...

arxiv preprint - DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

27 Jul 2024

Contributed by Lukas

In this episode, we discuss DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM by Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenh...

arxiv preprint - Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

23 Jul 2024

Contributed by Lukas

In this episode, we discuss Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning by Kaiwen Wang, Rahul Kidambi, R...

arxiv preprint - Chameleon: Mixed-Modal Early-Fusion Foundation Models

22 Jul 2024

Contributed by Lukas

In this episode, we discuss Chameleon: Mixed-Modal Early-Fusion Foundation Models by Chameleon Team. The paper introduces Chameleon, a family of model...

arxiv preprint - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

18 Jul 2024

Contributed by Lukas

In this episode, we discuss Goldfish: Vision-Language Understanding of Arbitrarily Long Videos by Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, ...

arxiv preprint - Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

17 Jul 2024

Contributed by Lukas

In this episode, we discuss Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity by Santiago Pascual, Chunghsin Yeh, Ioannis Tsia...

arxiv preprint - Human-like Episodic Memory for Infinite Context LLMs

15 Jul 2024

Contributed by Lukas

In this episode, we discuss Human-like Episodic Memory for Infinite Context LLMs by Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Chri...

arxiv preprint - Learning to (Learn at Test Time): RNNs with Expressive Hidden States

12 Jul 2024

Contributed by Lukas

In this episode, we discuss Learning to (Learn at Test Time): RNNs with Expressive Hidden States by Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun V...

arxiv preprint - Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

11 Jul 2024

Contributed by Lukas

In this episode, we discuss Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions by Yu-Guan Hsieh, Cheng-Yu Hsieh,...

arxiv preprint - Evaluating Human Alignment and Model Faithfulness of LLM Rationale

09 Jul 2024

Contributed by Lukas

In this episode, we discuss Evaluating Human Alignment and Model Faithfulness of LLM Rationale by Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng. The p...

arxiv preprint - Detection and Measurement of Syntactic Templates in Generated Text

08 Jul 2024

Contributed by Lukas

In this episode, we discuss Detection and Measurement of Syntactic Templates in Generated Text by Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C...

arxiv preprint - From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

01 Jul 2024

Contributed by Lukas

In this episode, we discuss From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Zhe...

Activity Overview

Episodes

Arxiv paper - VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Arxiv paper - ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Arxiv paper - Teaching Language Models to Critique via Reinforcement Learning

Arxiv paper - PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Arxiv paper - VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Arxiv paper - Heuristically Adaptive Diffusion-Model Evolutionary Strategy

Arxiv paper - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Arxiv paper - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Arxiv paper - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Arxiv paper - VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Arxiv paper - HunyuanVideo: A Systematic Framework For Large Video Generative Models

Arxiv paper - s1: Simple test-time scaling

Arxiv paper - Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Arxiv paper - MatAnyone: Stable Video Matting with Consistent Memory Propagation

Arxiv paper - Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Arxiv paper - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Arxiv paper - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Arxiv paper - Improving Video Generation with Human Feedback

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Arxiv paper - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Arxiv paper - Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Arxiv paper - Improving Factuality with Explicit Working Memory

Arxiv paper - Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

Arxiv paper - FaceLift: Single Image to 3D Head with View Generation and GS-LRM

Arxiv paper - GenHMR: Generative Human Mesh Recovery

Arxiv paper - Video Creation by Demonstration

Arxiv paper - Byte Latent Transformer: Patches Scale Better Than Tokens

Arxiv paper - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Arxiv paper - FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Arxiv paper - ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Arxiv paper - o1-Coder: an o1 Replication for Coding

Arxiv paper - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

ICLR 2025 submission - CYCLE-CONSISTENT LEARNING FOR JOINT LAYOUT-TO-IMAGE GENERATION AND OBJECT DETECTION

Arxiv Paper - WonderWorld: Interactive 3D Scene Generation from a Single Image

Arxiv Paper - Hymba: A Hybrid-head Architecture for Small Language Models

Arxiv Paper - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Arxiv Paper - Video Instruction Tuning With Synthetic Data

Arxiv Paper - Generative Agent Simulations of 1,000 People

NeurIPS 2024 - Moving Off-the-Grid: Scene-Grounded Video Representations

Arxiv Paper - Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

Arxiv Paper - FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Arxiv Paper - Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Arxiv Paper - Long Context RAG Performance of Large Language Models

Arxiv Paper - NVLM: Open Frontier-Class Multimodal LLMs

Arxiv Paper - ColPali: Efficient Document Retrieval with Vision Language Models

Arxiv Paper - Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Arxiv Paper - Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Arxiv Paper - Unbounded: A Generative Infinite Game of Character Life Simulation

Arxiv Paper - Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?

Arxiv Paper - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Arxiv Paper - When Does Perceptual Alignment Benefit Vision Representations?

Arxiv paper - SceneCraft: Layout-Guided 3D Scene Generation

arxiv preprint - A Tale of Tails: Model Collapse as a Change of Scaling Laws

arxiv preprint - Thinking LLMs: General Instruction Following with Thought Generation

arxiv preprint - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

arxiv preprint - F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

arxiv preprint - One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

arxiv preprint - Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

arxiv preprint - NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING

arxiv preprint - SHIC: Shape-Image Correspondences with no Keypoint Supervision

arxiv preprint - E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

arxiv preprint - LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

arxiv preprint - DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

arxiv preprint - Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

arxiv preprint - Phantom of Latent for Large Language and Vision Models

arxiv preprint - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

arxiv preprint - On the Diagram of Thought

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

arxiv preprint - SongCreator: Lyrics-based Universal Song Generation

arxiv preprint - Achieving Human Level Competitive Robot Table Tennis

arxiv preprint - Sapiens: Foundation for Human Vision Models

arxiv preprint - Re-Reading Improves Reasoning in Large Language Models

arxiv preprint - SPIRE: Semantic Prompt-Driven Image Restoration

arxiv preprint - Automated Design of Agentic Systems

arxiv preprint - Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

arxiv preprint - To Code, or Not To Code? Exploring Impact of Code in Pre-training

arxiv preprint - Segment Anything with Multiple Modalities