QUICK REVIEW

[論文レビュー] Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma, Linlong Lang|arXiv (Cornell University)|Mar 19, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

CCL はクロスモーダル文脈学習を用いてジョイント音声-映像生成を強化し、従来の手法より少ない学習データとリソースで高い AV 一貫性を達成する。

ABSTRACT

The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

研究の動機と目的

デュアルストリーム・トランスフォーマー基盤のジョイント音声-映像生成の改善を動機づける。
ゲーティング、クロスモーダルアテンションのバイアス、訓練-推論の不整合の限界を特定する。
これらの課題に対処する新しいモジュールを備えたクロスモーダル文脈学習（CCL）を提案する。
CCL が資源利用を抑えつつ競争力ある、あるいはそれ以上の性能を達成することを示す。

提案手法

音声/映像を時間的に整合させる Temporally Aligned RoPE and Partitioning（TARP）を強化としてデュアルストリームトランスフォーマーの骨格を維持する。
クロスモーダル文脈注意（CCA）に Learnable Context Tokens（LCT）と Dynamic Context Routing（DCR）を導入し、クロスモーダルガイダンスを安定化させる。
Unconditional Context Guidance（UCG）を用いて訓練-推論の一貫性を改善し、推論時の CFG 因子の衝突を低減する。
テキスト-to-動画、テキスト-to-音声、音声-to-動画、動画-to-音声、ジョイント音声-動画タスクそれぞれに異なるサンプリング確率を用いたマルチタスク訓練を適用する。
音声拡散モデルを外部音声データセットで事前訓練し、動画拡散ストリームを Wan2.1-14B から初期化した後、ジョイント音声-動画訓練を行う。

実験結果

リサーチクエスチョン

RQ1ゲーティングベースのアプローチと比較して、クロスモーダル文脈学習はジョイント音声-映像生成の訓練を安定化させ、速度を向上させるか。
RQ2LCT と DCR はクロスモーダル生成における収束性、一貫性、背景-前景分離にどのような影響を与えるか。
RQ3Unconditional Context Guidance は訓練-推論の一貫性を改善し、テキストとクロスモーダル条件付け間の衝突を緩和するか。
RQ4最近のベースラインと比較して、CCL は音声品質、リップシンク、AV アライメント指標においてどの程度の性能向上を示すか。

主な発見

CCL はゲーティングベースのベースラインよりも訓練の収束が速く安定している。
LCT と DCR は背景アンカーと動的ルーティングを提供し、クロスモーダルの一貫性と収束を改善する。
UCG は訓練-推論の衝突を低減し、リップシンクと AV アライメント指標を改善する。
最近のオープンソース手法（Ovi、LTX-2、MOVA）と比較して、CCL は substantially fewer training resources で競争力のある AV 一貫性を達成する。
アブレーション研究により、各モジュール（TARP、LCT/DCR、UCG）が性能向上に寄与することを確認。
easy と hard のサブセットを含むテストデータセットで、CCL はリップシンクと AV アライメントの強力な性能を示しつつ、音声品質も妥当な水準を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。