QUICK REVIEW

[論文レビュー] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Zhihao He, Tieyuan Chen|arXiv (Cornell University)|Jan 25, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

VidLaDA は video 理解のための双方向拡散言語モデルを導入し、並列トークン予測と改善された時空モデリングを実現、MARS-Cache は推論を 12 倍超高速化し精度を落とさない。最先端の AR ベースラインと競合。

ABSTRACT

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

研究の動機と目的

autoregressive decoding に依存する既存の Video LLM の効率性と有効性のギャップに取り組む動機付け。
video の時空理解を改善する双方向拡散言語モデルを提案。
拡散デコーディングの計算オーバーヘッドを、マルチモーダルデータ用に設計された加速フレームワークで緩和。
双方向拡散が標準的な video reasoning ベンチマークで AR モデルに対抗できることを示す。

提案手法

全双方向注意機構を備えた Diffusion Language Model を用いて、視覚トークンとテキストプロンプト間のグローバルな時空相互作用を解放。
フレームを時空の視覚トークンに処理し、プロンプトと部分応答とマスク拡散フレームワークで結合。
minute-scale/長 duration 理解を扱うため、短いクリップから長編動画へと段階的カリキュラムで VidLaDA を訓練。
MARS-Cache を導入して、フレームごとのチャンク注意、適応アンカートークン探索、モダリティ間およびネットワーク深度に跨る非同期キャッシュ更新を通じて冗長計算を削減。

Figure 1 : The overall architecture of VidLaDA. Input video frames ${\mathcal{V}}$ are encoded and spatially pooled (via $2\times 2$ downsampling) before being unrolled into a sequence of Spatiotemporal Visual Tokens ${{\bm{E}}^{\mathcal{V}}}$ . These tokens, combined with the text prompt $P$ and th

実験結果

リサーチクエスチョン

RQ1 双方向拡散デコoding は autoregressive ベースラインと比べて video LLM の時空理解を改善できるか？
RQ2 MARS-Cache フレームワークは多模態拡散デコーディングにおいて精度を損なうことなく substantial な速度改善を提供するか？
RQ3 VidLaDA は diverse ベンチマーク（例：LongVideoBench, MLVU, EgoSchema）で最先端の AR および DLM Video LLMs に対してどの程度の性能か？

主な発見

VidLaDA は既存の DLM ベースラインを一貫して上回り、トップ AR Video LLMs と高度に競合。
MARS-Cache は vanilla DLM デコーディングと比較して精度の損失なしにスループットを 12 倍超向上。
双方向注意は非対称受容野の問題を緩和し、グローバルな時空証拠の統合を強化。
複雑な時空推論と長編動画理解を要するタスクで VidLaDA が優れていることを実験で示す。
CoT 推論と MARS-Cache はベンチマーク全体で通過性のあるスループット向上（8-12x）を維持し、CoT 設定ではしばしば AR のスループットを上回る。
アブレーションはアンカートークンと非同期更新が accuracy と efficiency のバランスにとって重要であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。