QUICK REVIEW

[논문 리뷰] VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

Zhihao Hu, Dong Xu|arXiv (Cornell University)|2023. 07. 26.

Generative Adversarial Networks and Image Synthesis인용 수 12

한 줄 요약

VideoControlNet은 ControlNet과 모션 정보를 결합한 확산 모델을 사용하여 입력 비디오를 다양한 프롬프트로 변환하면서 콘텐츠 일관성을 유지합니다. I-프레임은 ControlNet으로 생성하고, P-프레임은 MgPG, B-프레임은 MgBI를 사용합니다.

ABSTRACT

Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.

연구 동기 및 목표

확산 기반 비디오 생성의 동기를 제시하고 프레임 단위 접근의 시간적 불일치를 해결한다.
입력 비디오 콘텐츠에 확산 모델을 조건시키기 위해 ControlNet을 활용한다.
시간적 중복을 줄이기 위해 모션 가이드 P-프레임 생성(MgPG)과 모션 가이드 B-프레임 보간(MgBI)을 도입한다.
VideoControlNet이 StableDiffusion의 기능을 상속하면서 프롬프트 간에 일관된 비디오 번역을 제공함을 보여준다.

제안 방법

입력 프레임에서 도출된 조건 이미지에 조건화된 ControlNet을 사용하여 StableDiffusion으로 I-프레임을 생성한다.
후속 프레임을 GoP로 나누고, 모션 보상을 위한 광류를 사용하고 새로 나타난 영역에 대한 인페인팅을 포함한 MgPG로 P-프레임을 생성한다.
잔여 정보와 가림 맵을 결합하여 가려진 영역의 확산 기반 인페인팅을 안내하는 인페인팅 마스크를 계산한다.
가장 가까운 두 I/P 프레임 사이를 모션 정보를 사용해 보간하고 왜곡 프레임의 매칭 점수 기반 융합으로 MgBI를 통해 B-프레임을 생성한다.
광류 추정을 위해 FlowFormer를 사용하고 주어진 텍스트 프롬프트에 따라 프레임을 번역하기 위해 ControlNet 조건의 확산을 채택한다.

실험 결과

연구 질문

RQ1입력 비디오의 모션 정보를 확산 기반 비디오 번역에서 콘텐츠 일관성을 유지하는 데 어떻게 활용할 수 있는가?
RQ2중복 재생성을 줄이고 시간적 일관성을 향상시키기 위해 모션 가이드 전략을 사용해 P-프레임과 B-프레임을 효과적으로 생성하거나 보간할 수 있는가?
RQ3품질, 일관성 및 속도 측면에서 VideoControlNet이 기존의 확산 기반 비디오 번역 방법과 비교하여 어떠한가?
RQ4다른 입력 조건(예: canny/깊이)으로 확산을 조건화하는 것이 출력 품질에 어떤 실용적 영향을 미치는가?

주요 결과

사용자 연구에서 VideoControlNet이 Text2Video-Zero 및 CCPL보다 선호된다(투표의 74.7%).
다비스(DAVIS) 데이터셋에서 VideoControlNet은 Text2Video-Zero보다 더 나은 객관적 지표를 달성한다: FVD 981.99 vs 1670.39; IS 18.02 vs 13.23; FID 92.17 vs 119.01; CLIPSIM 26.14 vs 25.66; LPIPS 0.50 vs 0.56.
광류 오차는 VideoControlNet이 7.91로 Text2Video-Zero의 17.99보다 낮다.
VideoControlNet은 프레임당 평균 0.30fps로 작동하고 Text2Video-Zero는 0.19fps이며, GoP=10 및 20 확산 단계에서 프레임당 평균 시간은 3.4초이다.
P-프레임 생성(MgPG) 및 B-프레임 보간(MgBI)은 모든 프레임에서 전체 확산 패스를 수행해야 하는 필요성을 줄여 속도를 향상시키고(MgBI가 특히 더 빠름).
이 방법은 모션 정보와 ControlNet이 결합된 사전 학습된 StableDiffusion 모델의 역량에 의해 높은 콘텐츠 일관성을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.