QUICK REVIEW

[논문 리뷰] Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe, Eric Hedlin|arXiv (Cornell University)|2026. 03. 20.

Image and Video Quality Assessment인용 수 0

한 줄 요약

논문은 비디오 확산 모델의 어텐션 맵과 정렬되도록 최적화된 사용자 바운딩 박스에 대한 작고 미분 가능하는 조정이 재훈련 없이도 생성 품질과 공간 제어 준수를 크게 향상시킨다.

ABSTRACT

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

연구 동기 및 목표

텍스트-투-비디오 확산 모델에서 사용자가 지정한 바운딩 박스 제어에의 부합성을 개선한다.
내부 어텐션 맵과 일치하는 미분 가능 바운딩 박스 편집 파이프라인을 개발한다.
전경 제어와 배경 충실도 간의 균형을 맞추어 전체 비디오 품질을 유지한다.
사용자 입력에 가까우면서 박스 내부의 어텐션을 촉진하고 배경 어텐션을 보존하는 최적화 목표를 제공한다.
여러 백본에 걸쳐 정량적 지표와 사용자 연구를 통해 개선을 입증한다.

제안 방법

이산 경계 아티팩트 없이 바운딩 박스를 조정하기 위한 미분 가능 어텐션 맵 편집을 도입한다.
매끈한 가우시안 및 매끈한 엣지 함수로 구성된 완전히 미분 가능 마스크로 비미분 가능 편집을 대체한다.
편집된 박스 내에서 다음 계층의 어텐션을 최대화하고 외부 어텐션을 보존하기 위한 균형 항을 포함하는 어텐션 정렬 손실을 정의한다.
편집이 원래의 사용자 제공 바운딩 박스에 가까이 있도록 규제한다.
여러 편집 단계에 걸쳐 Adam을 사용한 그래디언트 기반 업데이트로 바운딩 박스를 최적화한다.

Figure 2 : Overview – We inject bounding box control for video diffusion models by editing their cross attention maps within the network. However, not all such edits are friendly to video diffusion models as they are not trained with such edits. Thus, when applying these edits, we make sure that thi

실험 결과

연구 질문

RQ1사용자 바운딩 박스에 대한 작고 미분 가능한 조정이 바운딩 박스 제어 비디오 생성의 충실도를 향상시킬 수 있는가?
RQ2바운딩 박스 편집을 어떻게 미분 가능하게 만들고 비디오 확산 모델의 교차 어텐션 맵과 일치하도록 최적화할 수 있는가?
RQ3박스 내부의 어텐션을 최적화하는 것이 배경 충실도와 전반적인 생성 품질에 영향을 미치는가?
RQ4다른 백본에서 조정된 박스가 객관 지표와 사람 선호도를 개선하는가?
RQ5박스 내부에 집중하면서 배경 어텐션을 유지하는 균형 손실의 영향은 무엇인가?

주요 결과

모델	PickScore ↑	HPSv2 ↑	mIOU ↑
Trailblazer Ma et al. (2024b)	0.244	0.222	0.37
Our boxes + Trailblazer backbone	0.257	0.223	0.36
Our method w/o Box Opt.	0.243	0.221	0.37
Our method (full)	0.257	0.225	0.37
Peekaboo (1)	0.125	0.189	0.30
Peekaboo (2)	0.146	0.222	0.37
Freetraj (1)	0.178	0.223	0.34
Trailblazer + T2V-Turbo backbone	0.234	0.253	0.41
Our method using T2V-Turbo backbone	0.317	0.263	0.41

제안된 미분 가능 박스 편집은 비교적 소폭의 바운딩 박스 변화로도 큰 품질 향상을 가져온다.
다음 계층의 출력에서 어텐션을 최적화하면 사용자의 의도에 대한 준수가 향상된다.
박스 안팎의 어텐션을 균형 있게 하여 배경 디테일을 보존하고 비정상적 결과를 방지한다.
본 방법은 Peekaboo 및 Trailblazer와 같은 베이스라인보다 다수의 백본에서 인간 선호도 지표에서 우수하다.
조정된 박스를 Trailblazer 백본과 함께 사용하면 성능이 더욱 향상되며 편집의 전이 가능성을 보여준다.
정량적 결과는 PickScore, HPSv2, 및 mIOU가 베이스라인과 비교해 경쟁력 있거나 우수함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.