QUICK REVIEW

[論文レビュー] Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe, Eric Hedlin|arXiv (Cornell University)|Mar 20, 2026

Image and Video Quality Assessment被引用数 0

ひとこと要約

小さく差分可能なユーザ境界ボックスの微調整を、動画拡散モデルの注意マップと整列させる最適化で、再訓練なしに生成品質と空間制御の両方を大幅に向上させる。

ABSTRACT

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

研究の動機と目的

Text-to-video diffusionモデルにおけるユーザー指定境界ボックス制御の適合性を改善する。
内部の注意マップと整列する微分可能な境界ボックス編集パイプラインを開発する。
前景制御と背景の忠実度のバランスを取り、全体の動画品質を維持する。
注意をボックス内に促しつつ背景の注意を保持し、ユーザー入力に近づける最適化目的を提供する。
複数のバックボーンにわたる定量指標とユーザー調査を通じて改善を実証する。

提案手法

離散的な境界ボーダーのアーティファクトを生まない、微分可能な注意マップ編集を導入して境界ボックスを調整する。
滑らかなガウス関数と滑らかなエッジ関数から構成された完全に微分可能なマスクに置換して非微分可能な編集を置換する。
編集されたボックス内の次の層の注意を最大化し、外部の注意を保持する調整項を含む注意整列損失を定義する。
元のユーザー提供境界ボックスに近づくよう編集を正則化する。
複数の編集ステップにわたってAdamを用いた勾配ベースの更新で境界ボックスを最適化する。

Figure 2 : Overview – We inject bounding box control for video diffusion models by editing their cross attention maps within the network. However, not all such edits are friendly to video diffusion models as they are not trained with such edits. Thus, when applying these edits, we make sure that thi

実験結果

リサーチクエスチョン

RQ1ユーザー境界ボックスに対する小さく微分可能な調整は、ボックス制御された動画生成の忠実度を改善できるか。
RQ2ボックス編集を微分可能にし、動画拡散モデルのクロス注意マップと整列させるように最適化するにはどうすればよいか。
RQ3ボックス内の注意を最適化すると、背景の忠実度と全体の生成品質に影響を与えるか。
RQ4調整済みボックスは、異なるバックボーン間で客観指標と人間嗜好を改善するか。
RQ5背景の注意を維持しつつボックス内に焦点を当てる平衡損失の影響はどのようか。

主な発見

Model	PickScore ↑	HPSv2 ↑	mIOU ↑
Trailblazer Ma et al. (2024b)	0.244	0.222	0.37
Our boxes + Trailblazer backbone	0.257	0.223	0.36
Our method w/o Box Opt.	0.243	0.221	0.37
Our method (full)	0.257	0.225	0.37
Peekaboo (1)	0.125	0.189	0.30
Peekaboo (2)	0.146	0.222	0.37
Freetraj (1)	0.178	0.223	0.34
Trailblazer + T2V-Turbo backbone	0.234	0.253	0.41
Our method using T2V-Turbo backbone	0.317	0.263	0.41

提案された微分可能なボックス編集は、ボックスの変更が控えめでも大きな品質向上をもたらす。
次の層の出力内の注意を最適化することで、ユーザーの意図への適合性が向上する。
ボックス内外の注意のバランスを取ることで背景のディテールを保持し、退化的な結果を防ぐ。
本手法はPeekabooやTrailblazerなどのベースラインを、複数のバックボーンで人間の嗜好指標で上回る。
Trailblazerバックボーンを用いた場合、調整済みボックスの性能がさらに向上し、編集の転移性を示す。
定量的結果は、PickScore、HPSv2、mIOUがベースラインと比較して競争力がある、あるいは優れていることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。