QUICK REVIEW

[论文解读] Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe, Eric Hedlin|arXiv (Cornell University)|Mar 20, 2026

Image and Video Quality Assessment被引用 0

一句话总结

该论文表明，对用户边界框进行小幅、可微调的调整，并与视频扩散模型的注意力图对齐优化，显著提升生成质量和对空间控制的符合度，同时无需重新训练。

ABSTRACT

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

研究动机与目标

在文本到视频扩散模型中提高对用户指定边界框控制的遵循性。
开发一个与内部注意力图对齐的可微分边界框编辑管线。
在前景控制与背景保真之间取得平衡，以维持整体视频质量。
提供一个优化目标，鼓励盒内注意力，同时保持背景注意力并贴近用户输入。
通过对多种骨干网络的定量指标和用户研究来证明改进。

提出的方法

引入可微分的注意力图编辑以在边界框调整时避免离散边界伪影。
用平滑高斯函数和光滑边缘函数构建的完全可微掩码替代非可微编辑。
定义一个与注意力对齐的损失，最大化编辑后盒内下一层的注意力，并包含平衡项以保留盒外注意力。
对编辑进行正则化，使其尽量接近原始用户提供的边界框。
通过使用Adam在多步编辑中进行梯度更新来优化边界框。

Figure 2 : Overview – We inject bounding box control for video diffusion models by editing their cross attention maps within the network. However, not all such edits are friendly to video diffusion models as they are not trained with such edits. Thus, when applying these edits, we make sure that thi

实验结果

研究问题

RQ1对用户边界框进行小幅、可微调的调整能否提升受边界框控制的视频生成的保真度？
RQ2如何使边界框编辑具备可微性并优化以对齐视频扩散模型中的跨注意力映射？
RQ3在盒内优化注意力是否会影响背景保真度和整体生成质量？
RQ4在不同骨干网络上，调整后的边界框是否能提升客观指标和人类偏好？
RQ5平衡损失对在聚焦盒内同时保持背景注意力的影响是什么？

主要发现

Model	PickScore ↑	HPSv2 ↑	mIOU ↑
Trailblazer Ma et al. (2024b)	0.244	0.222	0.37
Our boxes + Trailblazer backbone	0.257	0.223	0.36
Our method w/o Box Opt.	0.243	0.221	0.37
Our method (full)	0.257	0.225	0.37
Peekaboo (1)	0.125	0.189	0.30
Peekaboo (2)	0.146	0.222	0.37
Freetraj (1)	0.178	0.223	0.34
Trailblazer + T2V-Turbo backbone	0.234	0.253	0.41
Our method using T2V-Turbo backbone	0.317	0.263	0.41

所提出的可微分盒编辑在仅进行适度边界框变更的情况下实现了显著的质量提升。
优化盒内下一层输出的注意力有助于提高对用户意图的遵循。
在盒内外平衡注意力有助于保留背景细节，避免退化结果。
该方法在多种骨干网络上相较基线如 Peekaboo 与 Trailblazer，在人类偏好指标上表现更优。
将调整后的边界框用于 Trailblazer 骨干进一步提升性能，显示编辑具备迁移性。
量化结果在 PickScore、HPSv2、mIOU 等方面与基线相比具有竞争力或更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。