QUICK REVIEW

[논문 리뷰] MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Vikram Voleti, Alexia Jolicoeur‐Martineau|arXiv (Cornell University)|2022. 05. 19.

Generative Adversarial Networks and Image Synthesis인용 수 46

한 줄 요약

MCVD는 과거/미래 프레임 블록에서 학습된 마스킹된 조건부 확산 프레임워크를 사용해 비디오 예측, 무조건 생성, 보간을 통합하고, 효율적인 블록 단위 자기회귀 생성으로 최첨단 결과를 달성합니다.

ABSTRACT

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch

연구 동기 및 목표

고품질의 일반화 가능한 비디오 예측 및 생성 across tasks(예측, 생성, 보간) 도전 과제를 동기화합니다.
프레임 마스킹을 통해 여러 비디오 합성 작업을 처리할 수 있는 단일 확률적 조건 스코어 기반 확산 모델을 제안합니다.
실용적인 컴퓨트 예산(≤4 GPUs) 내에서 장기 시퀀스 합성을 가능하게 하는 블록 단위 자기회귀 생성 접근법을 개발합니다.
SPATIN 조건을 가진 컨벌루션 U-net 아키텍처를 도입하여 흐름이나 재귀 없이 시공간 역학을 모델링합니다.

제안 방법

forward diffusion process q_t와 denoising 신경망 ε_θ를 갖는 역확산 프로세스 p_t를 활용합니다.
학습 중 과거 및/또는 미래 프레임 블록의 무작위 마스킹을 통해 확산을 과거 및/또는 미래에 조건화합니다(Bi(p_mask) 마스킹 확률).
마스킹된 과거 및/또는 미래 프레임에 조건화된 unified loss(L(θ))를 통해 미래/과거 예측, 무조건 생성, 보간을 처리하도록 단일 네트워크를 학습합니다.
한 단계에 여러 프레임을 생성하는 블록 단위 자기회귀 방식으로 장기 시퀀스를 가능하게 합니다.
조건 프레임을 노이즈가 있는 현재 프레임과 융합하기 위해 SPATIN(시공간 적응 정규화)을 갖춘 U-net을 사용하고, 노이즈 레벨 임베딩을 통한 시간 조건화를 포함합니다.
입력 연결(concatenation) 및 SPATIN 조건화를 포함한 여러 변형을 평가하여 성능 트레이드를 확인합니다.

실험 결과

연구 질문

RQ1랜덤으로 마스킹된 과거 및/또는 미래 프레임으로 조건화된 단일 확산 기반 모델이 비디오 예측, 무조건 생성 및 보간을 수행할 수 있는가?
RQ2과거/미래 마스킹이 작업 및 데이터셋 전반에서 일반화 및 품질을 향상시키는가?
RQ3프레임 단위 생성과 완전 무조건 생성에 비해 블록 단위 자기회귀 생성이 장기적 일관성 및 효율성 측면에서 어떤 차이를 보이는가?
RQ4품질과 메모리 사용량 사이에서 어떤 아키텍처 조건화(SPATIN 대 연결(concatenation))이 최상의 트레이드오프를 제공하는가?

주요 결과

SMMNIST, BAIR, Cityscapes 비디오 예측 벤치마크에서 최첨단 결과를 달성합니다.
SMMNIST, KTH, BAIR에서 강력한 보간 성능을 보이며, 종종 전문 보간 방법을 능가합니다.
과거 마스킹을 정규화로 사용하면 마스킹 없이 대비되는 베이스라인보다 예측, 생성, 보간 작업 전반에서 성능이 향상됩니다.
단일 MCVD 모델로 무조건 생성 및 조건부 예측을 함께 학습시킬 수 있으며, 비교적 modest한 컴퓨트(≤4 GPUs에서 1–12일의 학습 시간)로 경쟁력 있을 만큼의 성능을 달성합니다.
블록 단위 자기회귀 생성은 광학 흐름이나 재귀 모듈에 의존하지 않으면서도 품질과 일관성을 유지하며, 긴 비디오 시퀀스를 가능하게 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.