Skip to main content
QUICK REVIEW

[论文解读] Lumiere: A Space-Time Diffusion Model for Video Generation

Omer Bar-Tal, Hila Chefer|arXiv (Cornell University)|Jan 23, 2024
Generative Adversarial Networks and Image Synthesis被引用 17
一句话总结

Lumiere 提出一个时空扩散模型,使用 Space-Time U-Net 一次性生成全时长视频,实现全局时间一致性并支持多样化的视频编辑任务。

ABSTRACT

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

研究动机与目标

  • Motivate the need for globally coherent motion in text-to-video generation.
  • Propose a Space-Time U-Net (STUNet) that downsamples in space and time to generate full-duration videos in one pass.
  • Leverage a pre-trained text-to-image diffusion model with spatial super-resolution to produce high-resolution videos.
  • Introduce Multidiffusion to ensure temporal continuity across overlapping SSR segments.
  • Demonstrate applications including image-to-video, video inpainting, and stylized generation.

提出的方法

  • Introduce Space-Time U-Net (STUNet) that downsamples in both space and time and processes most computation on a compact space-time representation.
  • Incorporate temporal down- and up-sampling modules after each pre-trained T2I layer to enable full-duration generation.
  • Use factorized space-time convolutions and temporal attention at the coarsest level to capture motion while controlling compute.
  • Initialize temporal blocks to nearest-neighbor down/up sampling to preserve starting behavior.
  • Extend Multidiffusion to aggregate SSR predictions from overlapping temporal windows for global coherence over the full video.
  • Train newly added temporal layers while keeping pre-trained T2I weights fixed.

实验结果

研究问题

  • RQ1Can a single base diffusion model generate a full video duration with global temporal coherence without relying on cascaded TSR models?
  • RQ2How can spatial super-resolution be applied across overlapping temporal windows to maintain consistency in high-resolution video generation?
  • RQ3What downstream tasks (image-to-video, inpainting, stylization) can be effectively supported by a full-duration T2V model?
  • RQ4Does grounding the temporal dynamics in a Space-Time U-Net improve motion coherence compared to traditional cascaded approaches?
  • RQ5What is the impact of conditioning (image, mask) on the quality and controllability of video generation?

主要发现

方法FVD ↓IS ↑
MagicVideo655.00-
Emu Video606.2042.70
Video LDM550.6133.45
Show-1394.4635.42
Make-A-Video367.2333.00
PYoCo355.1947.76
SVD242.02-
Lumiere (Ours)332.4937.54
  • Achieves state-of-the-art or competitive text-to-video generation quality with 5-second, 80-frame videos at 16fps.
  • Generates globally coherent motion by generating the full temporal duration in one pass using STUNet instead of cascaded TSR models.
  • Demonstrates versatile downstream capabilities including image-to-video, video inpainting, stylized generation, and cinematographic editing.
  • Zero-shot UCF101 evaluation shows competitive FVD and IS scores against baselines, with user studies favoring Lumiere over baselines.
  • Multidiffusion-based SSR over overlapping windows yields temporally coherent high-resolution videos without boundary artifacts.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。