QUICK REVIEW

[논문 리뷰] VideoComposer: Compositional Video Synthesis with Motion Controllability

Xiang Wang, Hangjie Yuan|arXiv (Cornell University)|2023. 06. 03.

Computer Graphics and Visualization Techniques인용 수 43

한 줄 요약

VideoComposer는 텍스트, 공간 신호, 시간 신호(특히 모션 벡터)를 함께 조건화하는 시공-시간 조건 인코더를 통해 프레임 간 일관성과 제어 가능한 모션을 달성하는 합성 확산 기반 비디오 합성 프레임워크를 제시한다.

ABSTRACT

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models will be publicly available at https://videocomposer.github.io.

연구 동기 및 목표

텍스트 프롬프트를 넘어 공간 및 시간 제어를 도입하여 제어 가능한 비디오 합성을 동기화한다.
비디오를 위한 텍스트-공간-시간의 삼요소 조건화 패러다를 제안한다.
프레임 간 다이내믹스를 안내하는 모션 벡터 기반의 시간 조건을 도입한다.
연속 조건을 통합하고 융합하기 위한 시공-시간 조건 인코더(STC-encoder)를 개발한다.
핸드 크래프트 모션을 포함한 다양한 조건 집합 하에서의 유연한 생성 능력을 입증한다.

제안 방법

사전 학습된 인코더/디코더를 가진 압축 비디오 잠재 공간에서 작동하는 잠재 확산 모델(VLDM)을 채택한다.
각 비디오 입력을 텍스트, 공간 및 시간 조건으로 분해하여 디노이저를 조건화한다.
MPEG-4 압축 비디오의 모션 벡터를 명시적 시간 가이드로 사용한다.
경량 공간 모듈과 시간 트랜스포머를 통해 시공-시간 정보를 추출하고 융합하기 위해 STC-encoder를 도입한다.
STC 인코딩 조건을 비디오 잠재 공간과 채널 단위 연결로 융합하고 텍스트/스타일 안내를 위해 교차 주의(attention)를 적용한다.
두 단계로 학습한다: 텍스트-비디오를 위한 시간적 전훈련과 다양한 조건과 함께하는 합성 학습.

실험 결과

연구 질문

RQ1비디오 합성을 텍스트, 공간, 시간 단서를 하나의 프레임워크에서 결합하여 어떻게 제어할 수 있는가?
RQ2모션 벡터를 명시적 시간 신호로 포함하면 프레임 간 일관성과 모션 제어가 향상되는가?
RQ3STC-encoder가 순차적 공간-시간 조건을 효과적으로 융합하여 다양한 입력에서 비디오 품질을 개선하는가?
RQ4STC-encoder와 모션 가이드가 프레임별 일관성과 모션 정확도에 어떤 영향을 미치는가?
RQ5비디오 생성에서 핸드 크래프트 모션, 스케치, 심도 지도, 마스크를 다룰 때 VideoComposer의 유연성은 어느 정도인가?

주요 결과

모션 벡터를 시간 조건으로 사용할 때 모션 제어 가능성이 향상된다.
STC-encoder의 포함은 입력(텍스트 + 스케치/심도/모션 벡터) 간 프레임 일관성을 더욱 향상시킨다.
STC-encoder가 없는 baselines와 비교할 때 VideoComposer는 더 높은 프레임 일관성 점수와 더 낮은 모션 제어 오차를 달성한다.
VideoComposer는 텍스트, 스케치, 심도 지도, 마스크 등 다양한 조건 유형에서 합성 비디오 생성을 보여주며 품질을 유지한다.
모션 벡터는 움직이는 영역을 우선시하여 표면 기반의 시간 신호보다 더 유연하고 정밀한 모션 제어를 가능하게 한다.
ablation 연구는 STC-encoder가 주관적 충실도와 정량적 프레임 일관성에 모두 기여함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.