QUICK REVIEW

[논문 리뷰] VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Xin Li, Wenqing Chu|arXiv (Cornell University)|2023. 09. 01.

Generative Adversarial Networks and Image Synthesis인용 수 14

한 줄 요약

VideoGen은 텍스트-이미지 참조를 이용한 참조-가이드 잠재 확산 파이프라인을 통해 고해상도이면서 시간적으로 일관된 비디오를 생성하며, 디코더에 필요한 텍스트-비디오 학습 데이터를 요구하지 않고 표준 T2V 벤치마크에서 최첨단 성능을 달성합니다.

ABSTRACT

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.

연구 동기 및 목표

풍부한 이미지-텍스트 데이터 활용으로 고품질의 시간적으로 일관된 텍스트-비디오 생성을 동기부여한다.
확산 기반 비디오 합성을 안내하기 위해 고품질의 T2I 생성 참조 이미지를 사용하여 비디오 콘텐츠의 신뢰도를 높인다.
비라벨 비디오에서 비디오 디코더 학습을 가능하게 하여 모션 리얼리즘과 시간적 일관성을 향상시킨다.
고해상도 출력을 위한 흐름 기반의 시간 업샘플링과 함께 cascaded 잠재 확산 프레임워크를 개발한다.

제안 방법

입력 텍스트 프롬프트에서 고정된 텍스트-투-이미지 모델(Stabe Diffusion)을 사용해 참조 이미지를 생성한다.
참조 이미지와 텍스트 프롬프트 모두에 조건화된 참조-가이드 cascaded 잠재 비디오 확산 모델을 사용해 저해상도에서 중간 해상도의 잠재 비디오 표현 시퀀스를 생성한다.
잠재 공간에서 흐름 기반의 시간 초해상 모듈을 적용해 시간 해상도를 업샘플링한다(단계당 2배, 최대 8배).
향상된 비디오 디코더를 이용해 잠재 비디오 표현을 고해상도 비디오로 매핑하며, 사전 학습된 이미지 디코더에서 초기화하고 시간적 컨볼루션과 어텐션을 적용한다.
참조 이미지는 학습 중 비디오의 첫 프레임으로 사용되며 WebVid-10M으로 텍스트-비디오 쌍에 cascaded 잠재 확산 네트워크를 학습하는 동시에 비디오 디코더와 시간 초해상은 비정렬(high-quality) 비디오에서 학습한다.

실험 결과

연구 질문

RQ1텍스트-투-이미지 모델로 생성된 참조 이미지가 텍스트-비디오 확산에서 충실도와 모션 학습을 개선할 수 있는가?
RQ2참조-가이드 잠재 확산과 흐름 기반 시간 업샘플링, 별도의 비디오 디코더를 결합했을 때 기존의 T2V 방법들보다 시각적 충실도와 시간적 일관성이 더 높은가?
RQ3비정렬 비디오에서 비디오 디코더를 학습하는 것이 모션 리얼리즘과 전체 비디오 품질에 어떤 영향을 미치는가?
RQ4확산 조건화에 고품질 참조 이미지를 통합하는 것이 표준 T2V 지표에 미치는 영향은 무엇인가?

주요 결과

Table 1: T2V results on UCF-101	Table 2: T2V results on MSR-VTT
CogVideo (Chinese)	Yes	Yes	480 × 480	23.55	751.34
CogVideo (English)	Yes	Yes	480 × 480	25.27	701.59
Make-A-Video	Yes	Yes	256 × 256	33.00	367.23
Ours	Yes	Yes	256 × 256	71.61 ± 0.24	554 ± 23
TGANv2	No	No	128 × 128	26.60 ± 0.47	-
DIGAN	No	No	-	32.70 ± 0.35	577 ± 22
MoCoGAN-HD	No	No	256 × 256	33.95 ± 0.25	700 ± 24
CogVideo	Yes	Yes	160 × 160	50.46	626
VDM	No	No	64 × 64	57.80 ± 1.3	-
LVDM	No	No	256 × 256	-	372 ± 11
TATS-base	Yes	Yes	128 × 128	79.28 ± 0.38	278 ± 11
Make-A-Video	Yes	Yes	256 × 256	82.55	81.25
Ours	Yes	Yes	256 × 256	82.78 ± 0.34	345 ± 15
GODIVA	No	Yes	128 × 128	0.2402	-
Nüwa	No	336 × 336	0.2439	-
CogVideo (Chinese)	Yes	Yes	480 × 480	0.2614	-
CogVideo (English)	Yes	Yes	480 × 480	0.2631	-
Make-A-Video	Yes	Yes	256 × 256	0.3049	-
Ours	Yes	Yes	256 × 256	0.3127	-

VideoGen은 질적·양적 평가에서 UCF-101 및 MSR-VTT에서 최첨단 결과를 달성한다.
제로샷 UCF-101에서 VideoGen의 IS 점수는 71.61±0.24로 베이스라인보다 우수하다(둘째 후보군은 약 33–57 범위).
MSR-VTT에서 VideoGen은 제로샷 설정에서 최고 평균 CLIPSIM 점수(0.3127)를 달성한다.
참조 이미지를 제거하면 CLIPSIM(0.2534) 및 IS(26.64±0.47)가 저하되고, T2I 참조 이미지를 포함하면 두 지표가 모두 개선된다.
흐름 기반 시간 업샘플링은 비흐름 가이드 보간보다 프레임 연속성과 안정성을 향상시킨다.
비정렬 비디오에서 학습된 비디오 디코더는 베이스라인보다 더 선명한 질감과 더 나은 시간적 매끄러움을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.