QUICK REVIEW

[논문 리뷰] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Andreas Blattmann, Robin Rombach|arXiv (Cornell University)|2023. 04. 18.

Generative Adversarial Networks and Image Synthesis인용 수 17

한 줄 요약

본 연구는 Video Latent Diffusion Models (Video LDMs)을 도입하여 사전 훈련된 이미지 LDM을 고해상도이면서 시간적으로 일관된 비디오 생성기로 바꾸고, 템포럴 정렬 레이어를 삽입 및 훈련시켜 긴 고품질 드라이빙 비디오 및 텍스트-투-비디오 생성을 가능하게 한다.

ABSTRACT

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

연구 동기 및 목표

사전 훈련된 이미지 확산 모델을 활용하여 고해상도의 장편 비디오를 효율적으로 생성한다.
전체 모델을 재훈련하지 않고도 이미지 생성기를 비디오 생성기로 전환하기 위해 템포럴 정렬 레이어를 도입한다.
드라이빙 장면과 텍스트-투-비디오 작업에 대해 시간적으로 일관되고 고품질의 비디오 합성을 달성한다.
다른 이미지 LDM 백본 간 템포럴 레이어의 전이 가능성과 개인화된 텍스트-투-비디오 생성을 가능하게 한다.]
method:[
고정된 이미지 LDM 백본에 템포럴 어텐션 또는 3D 컨볼루션 블록 같은 템포럴 레이어를 삽입하여 Video LDM을 형성한다.
스페이셜 레이어를 고정한 채 템포럴 레이어만 훈련하고, 이미지 확산과 유사한 디노이징 objective를 사용한다.
잠재 확산 프레임워크를 사용해 압축된 잠재 공간에서 작동하여 효율성과 확장성을 개선한다.
디코더 구성요소를 미세조정하여 픽셀 공간에서 템포럴 일관성을 달성한다(temporal autoencoder finetuning).
시작 프레임에 조건화된 시퀀스를 합성하기 위해 예측 모델과 맥락 마스킹으로 장기-수평 생성 가능하게 한다.
잠재 또는 픽셀 공간에서 템포럴 업샘플링(비디오 인식 초해상도)을 적용하기 위해 템포럴 정렬 업샘플러를 학습한다.

실험 결과

연구 질문

RQ1사전 훈련된 이미지 확산 모델을 템포럴 레이어를 추가하여 고해상도이고 시간적으로 일관된 비디오 생성을 재목적화할 수 있는가?
RQ2잠재 공간의 템포럴 정렬이 장시간 비디오 품질과 일관성에 어떤 영향을 미치는가?
RQ3한 이미지 LDM 백본에서 학습된 템포럴 레이어가 다른 백본으로의 전이 가능성이나 개인화된 텍스트-투-비디오 생성에 얼마나 활용될 수 있는가?
RQ4디코더와 업샘플러의 비디오 미세조정이 템포럴 일관성과 품질에 미치는 영향은 무엇인가?

주요 결과

방법	FVD	FID
LVG [6]	478	53.5
Ours	389	31.6
Ours (cond.)	356	51.9

512×1024 해상도에서 실제 드라이빙 장면 비디오에 대해 최첨단 비디오 품질을 달성.
템포럴 파인튜닝이 적용된 Video LDM은 FVD에서 Long Video GAN 베이스라인을 상회하면서도 FID는 경쟁력을 유지.
비디오 업샘플러의 템포럴 정렬은 시간적 일관성을 유지하고 FVD 악화를 방지하는 데 결정적이다.
Stable Diffusion을 텍스트-투-비디오 LDM으로 변환하고 템포럴 레이어를 통해 1280×2048 출력 제공 및 DreamBooth 전이를 통한 개인화된 텍스트-투-비디오 가능.
하나의 이미지 LDM 백본에서 학습된 템포럴 레이어가 다른 체크포인트로 일반화되어 개인화된 텍스트-투-비디오 생성을 가능하게 한다.
Video LDM은 수분 이상의 긴 비디오를 지원하고 다중 모달 드라이빙 시나리오 시뮬레이션을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.