QUICK REVIEW

[論文レビュー] AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin|arXiv (Cornell University)|Mar 25, 2026

Speech and Audio Processing被引用数 0

ひとこと要約

AVControl は frozen joint audio-visual backbone 上の並列キャンバス上でモダリティごとに LoRA を学習させ、最小限の学習データと手順で faithful な構造的ビデオ制御を可能にする。深度/姿勢で最先端の結果を達成し、編集および音声-視覚タスクで強力な成果を示す。

ABSTRACT

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

研究の動機と目的

ビデオとオーディオ生成を単一のモノリシックなモデルではなく、柔軟でモジュラーな制御へと促す。
Joint audio-visual バックボーンにリンクした並列キャンバス上で各制御モダリティを LoRA として学習する軽量フレームワークを導入する。
アーキテクチャの変更なしに faithful な構造的ビデオ制御と細粒度の推論時制御を実現する。
深度、姿勢、CanNy エッジ、カメラ軌跡、編集、音声-視覚タスクなど、広範なモダリティサポートとデータ効率的な訓練を示す。

提案手法

各制御モダリティを frozen LTX-2 バックボーン上の独立した LoRA として学習する。
参照信号を並列キャンバス上の追加の注意トークンとして配置し、参照と生成トークンを区別するトークンごとの timestep を用いる。
標準的な拡散目的を用い、注意投影とfeed-forward層にわたる LoRA アダプタのみを更新する。
制御強度はグローバルまたはローカルに、 target-reference attention の相互作用を調整して調整できる。
複数の制御を単一のキャンバス上でコンポジットによって結合する。
参照キャンバスの解像度を調整して、疎な制御の待機時間を低減する、小〜大規模な制御グリッドを採用する。

実験結果

リサーチクエスチョン

RQ1並列キャンバス上のモダリティ別 LoRA は、モノリシックまたはチャネル連結アプローチよりも多様な音声-視覚制御で優れるか。
RQ2モダリティを独立した LoRA に分離することは、深度、姿勢、カメラ軌跡、編集、音声-視覚タスク全般でデータ-および計算効率の良い訓練を可能にするか。
RQ3構造的忠実性を維持する点で、並列キャンバス条件付けはチャネル-wise 連結と比較してどうか。
RQ4既存のモダリティを再学習させることなく、新しい制御へフレームワークをどの程度一般化できるか。
RQ5推論時の参照強度と生成忠実度のトレードオフはどうなるか。

主な発見

Task	Method	AQ	BC	DD	IQ	MS	SC	Avg.
Depth	Ours	62.9	95.1	68.4	70.4	99.0	94.1	81.6
Depth	VACE	56.7	96.1	60.0	66.4	98.8	94.1	78.7
Pose	Ours	63.6	93.1	84.2	68.5	98.9	94.0	83.7
Pose	VACE	60.2	94.9	75.0	64.7	98.6	94.8	81.4
Inpainting	Ours	59.7	96.3	55.0	68.8	99.3	95.4	79.1
Inpainting	VACE	51.3	96.3	50.0	60.4	99.1	94.6	75.3
Outpainting	Ours	56.1	96.7	45.0	68.3	99.4	95.4	76.8
Outpainting	VACE	57.0	96.6	30.0	69.5	99.2	94.5	74.5

評価対象のベースラインの中で、深度、姿勢、インペインティング、およびアウトペインティングを横断する平均 VBench スコアで最高を達成。
Depth: Ours 62.9 AQ, 95.1 BC, 68.4 DD, 70.4 IQ, 99.0 MS, 94.1 SC, 81.6 Avg vs VACE 56.7, 96.1, 60.0, 66.4, 98.8, 94.1, 78.7.
Pose: Ours 63.6 AQ, 93.1 BC, 84.2 DD, 68.5 IQ, 98.9 MS, 94.0 SC, 83.7 Avg vs VACE 60.2, 94.9, 75.0, 64.7, 98.6, 94.8, 81.4.
Inpainting: Ours 59.7 AQ, 96.3 BC, 55.0 DD, 68.8 IQ, 99.3 MS, 95.4 SC, 79.1 Avg vs VACE 51.3, 96.3, 50.0, 60.4, 99.1, 94.6, 75.3.
Outpainting: Ours 56.1 AQ, 96.7 BC, 45.0 DD, 68.3 IQ, 99.4 MS, 95.4 SC, 76.8 Avg vs VACE 57.0, 96.6, 30.0, 69.5, 99.2, 94.5, 74.5.
Training efficiency: aggregate ~55K steps across 13 modalities, less than VACE’s 200K-step baseline.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。