[论文解读] Lumiere: A Space-Time Diffusion Model for Video Generation
Lumiere 提出一个时空扩散模型,使用 Space-Time U-Net 一次性生成全时长视频,实现全局时间一致性并支持多样化的视频编辑任务。
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
研究动机与目标
- Motivate the need for globally coherent motion in text-to-video generation.
- Propose a Space-Time U-Net (STUNet) that downsamples in space and time to generate full-duration videos in one pass.
- Leverage a pre-trained text-to-image diffusion model with spatial super-resolution to produce high-resolution videos.
- Introduce Multidiffusion to ensure temporal continuity across overlapping SSR segments.
- Demonstrate applications including image-to-video, video inpainting, and stylized generation.
提出的方法
- Introduce Space-Time U-Net (STUNet) that downsamples in both space and time and processes most computation on a compact space-time representation.
- Incorporate temporal down- and up-sampling modules after each pre-trained T2I layer to enable full-duration generation.
- Use factorized space-time convolutions and temporal attention at the coarsest level to capture motion while controlling compute.
- Initialize temporal blocks to nearest-neighbor down/up sampling to preserve starting behavior.
- Extend Multidiffusion to aggregate SSR predictions from overlapping temporal windows for global coherence over the full video.
- Train newly added temporal layers while keeping pre-trained T2I weights fixed.
实验结果
研究问题
- RQ1Can a single base diffusion model generate a full video duration with global temporal coherence without relying on cascaded TSR models?
- RQ2How can spatial super-resolution be applied across overlapping temporal windows to maintain consistency in high-resolution video generation?
- RQ3What downstream tasks (image-to-video, inpainting, stylization) can be effectively supported by a full-duration T2V model?
- RQ4Does grounding the temporal dynamics in a Space-Time U-Net improve motion coherence compared to traditional cascaded approaches?
- RQ5What is the impact of conditioning (image, mask) on the quality and controllability of video generation?
主要发现
| 方法 | FVD ↓ | IS ↑ |
|---|---|---|
| MagicVideo | 655.00 | - |
| Emu Video | 606.20 | 42.70 |
| Video LDM | 550.61 | 33.45 |
| Show-1 | 394.46 | 35.42 |
| Make-A-Video | 367.23 | 33.00 |
| PYoCo | 355.19 | 47.76 |
| SVD | 242.02 | - |
| Lumiere (Ours) | 332.49 | 37.54 |
- Achieves state-of-the-art or competitive text-to-video generation quality with 5-second, 80-frame videos at 16fps.
- Generates globally coherent motion by generating the full temporal duration in one pass using STUNet instead of cascaded TSR models.
- Demonstrates versatile downstream capabilities including image-to-video, video inpainting, stylized generation, and cinematographic editing.
- Zero-shot UCF101 evaluation shows competitive FVD and IS scores against baselines, with user studies favoring Lumiere over baselines.
- Multidiffusion-based SSR over overlapping windows yields temporally coherent high-resolution videos without boundary artifacts.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。