QUICK REVIEW

[논문 리뷰] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Zhihao He, Tieyuan Chen|arXiv (Cornell University)|2026. 01. 25.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

VidLaDA는 동영상 이해를 위해 양방향 diffusion language model을 도입하여 병렬 토큰 예측 및 향상된 시공간 모델링을 가능하게 하며, MARS-Cache가 정확도 손실 없이 추론을 12배 이상 가속합니다. 이는 최신 AR baselines와 경쟁합니다.

ABSTRACT

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

연구 동기 및 목표

자동회귀 디코딩에 의존하는 기존 Video LLM의 효율성과 효과성 격차를 해결할 필요성을 제기한다.
비디오의 시공간 이해를 개선하기 위한 양방향 diffusion language model을 제안한다.
다중모달 데이터용으로 설계된 가속 프레임워크를 통해 diffusion 디코딩의 계산 오버헤드를 완화한다.
표준 비디오 추론 벤치마크에서 양방향 diffusion이 AR 모델에 필적할 수 있음을 보여준다.

제안 방법

시각 토큰과 텍스트 프롬프트 간의 전역 시공간 상호작용을 열기 위해 전체 양방향 주의가 있는 Diffusion Language Model을 사용한다.
비디오 프레임을 시공간 시각 토큰으로 처리하고 이를 프롬프트 및 부분 응답과 함께 마스킹된 diffusion 프레임워크에서 결합한다.
VidLaDA를 짧은 클립에서 긴 형식의 비디오에 이르는 다단계 커리큘럼으로 학습시켜 분(minute) 규모의 길이 이해를 다룬다.
frame-wise chunk attention, adaptive anchor token searching, and asynchronous cache refreshing across modalities and network depth를 통해 중복 계산을 줄이기 위한 MARS-Cache를 도입한다.

Figure 1 : The overall architecture of VidLaDA. Input video frames ${\mathcal{V}}$ are encoded and spatially pooled (via $2\times 2$ downsampling) before being unrolled into a sequence of Spatiotemporal Visual Tokens ${{\bm{E}}^{\mathcal{V}}}$ . These tokens, combined with the text prompt $P$ and th

실험 결과

연구 질문

RQ1양방향 diffusion 디코딩이 AR 기반 모델과 비교했을 때 비디오 LLM의 시공간 이해를 향상시킬 수 있는가?
RQ2MARS-Cache 프레임워크가 정확도 손실 없이 다중모달 diffusion 디코딩에서 실질적인 속도 향상을 제공하는가?
RQ3다양한 벤치마크(LongVideoBench, MLVU, EgoSchema 등)에서 VidLaDA가 최첨단 AR 및 DLM 비디오 LLM에 비해 어떤 성능을 보이는가?

주요 결과

VidLaDA는 기존 DLM 기준선을 지속적으로 상회하며 최상위 AR Video LLMs와도 매우 경쟁력이 있다.
MARS-Cache는 정확도 손실 없이 vanilla DLM 디코딩 대비 12배 이상의 처리량 향상을 제공한다.
양방향 주의는 비대칭 수용 필드 문제를 완화하고 전역 시공간 증거 집계를 향상시킨다.
실험은 VidLaDA가 복잡한 시공간 추론과 장시간 비디오 이해를 필요로 하는 작업에서 뛰어나다는 것을 보여준다.
MARS-Cache를 활용한 CoT 추론은 벤치마크 전반에서 상당한 처리량 증가(8-12x)를 유지하며, CoT 설정에서 종종 AR 처리량을 능가한다.
변인 분석은 앵커 토큰과 비동기적 갱신이 정확도와 효율성의 균형에 결정적임을 나타낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.