QUICK REVIEW

[论文解读] Lumiere: A Space-Time Diffusion Model for Video Generation

Omer Bar-Tal, Hila Chefer|arXiv (Cornell University)|Jan 23, 2024

Generative Adversarial Networks and Image Synthesis被引用 17

一句话总结

Lumiere 提出一个时空扩散模型，使用 Space-Time U-Net 一次性生成全时长视频，实现全局时间一致性并支持多样化的视频编辑任务。

ABSTRACT

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

研究动机与目标

Motivate the need for globally coherent motion in text-to-video generation.
Propose a Space-Time U-Net (STUNet) that downsamples in space and time to generate full-duration videos in one pass.
Leverage a pre-trained text-to-image diffusion model with spatial super-resolution to produce high-resolution videos.
Introduce Multidiffusion to ensure temporal continuity across overlapping SSR segments.
Demonstrate applications including image-to-video, video inpainting, and stylized generation.

提出的方法

Introduce Space-Time U-Net (STUNet) that downsamples in both space and time and processes most computation on a compact space-time representation.
Incorporate temporal down- and up-sampling modules after each pre-trained T2I layer to enable full-duration generation.
Use factorized space-time convolutions and temporal attention at the coarsest level to capture motion while controlling compute.
Initialize temporal blocks to nearest-neighbor down/up sampling to preserve starting behavior.
Extend Multidiffusion to aggregate SSR predictions from overlapping temporal windows for global coherence over the full video.
Train newly added temporal layers while keeping pre-trained T2I weights fixed.

实验结果

研究问题

RQ1Can a single base diffusion model generate a full video duration with global temporal coherence without relying on cascaded TSR models?
RQ2How can spatial super-resolution be applied across overlapping temporal windows to maintain consistency in high-resolution video generation?
RQ3What downstream tasks (image-to-video, inpainting, stylization) can be effectively supported by a full-duration T2V model?
RQ4Does grounding the temporal dynamics in a Space-Time U-Net improve motion coherence compared to traditional cascaded approaches?
RQ5What is the impact of conditioning (image, mask) on the quality and controllability of video generation?

主要发现

方法	FVD ↓	IS ↑
MagicVideo	655.00	-
Emu Video	606.20	42.70
Video LDM	550.61	33.45
Show-1	394.46	35.42
Make-A-Video	367.23	33.00
PYoCo	355.19	47.76
SVD	242.02	-
Lumiere (Ours)	332.49	37.54

Achieves state-of-the-art or competitive text-to-video generation quality with 5-second, 80-frame videos at 16fps.
Generates globally coherent motion by generating the full temporal duration in one pass using STUNet instead of cascaded TSR models.
Demonstrates versatile downstream capabilities including image-to-video, video inpainting, stylized generation, and cinematographic editing.
Zero-shot UCF101 evaluation shows competitive FVD and IS scores against baselines, with user studies favoring Lumiere over baselines.
Multidiffusion-based SSR over overlapping windows yields temporally coherent high-resolution videos without boundary artifacts.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。