QUICK REVIEW

[论文解读] A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Yu Tian, Jian Ren|arXiv (Cornell University)|Apr 30, 2021

Generative Adversarial Networks and Image Synthesis参考文献 72被引用 36

一句话总结

本论文（MoCoGAN-HD）通过在其潜在空间中学习的运动轨迹来组合一个固定的、预训练的图像生成器，从而生成高质量、高分辨率视频，实现跨域视频合成并显著提高效率。

ABSTRACT

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.

研究动机与目标

证明一个固定、预训练的图像生成器可以通过学习潜在运动轨迹来驱动高分辨率视频合成。
对内容与运动进行解耦，以实现灵活的视频操作和跨域合成。
提高视频生成的效率，使分辨率达到HD（最高可达1024×1024）。
引入跨域视频合成，其中图像域和运动域来自不同数据集。

提出的方法

使用一个带有两层LSTM的运动生成器在共享的图像潜在空间中预测潜在轨迹。
将逐帧的潜在编码表示为相对于前一个编码的残差，通过潜在方向的PCA基础来计算。
采用对比图像判别器以强化内容一致性，以及多尺度视频判别器以学习真实感的运动模式。
最大化运动潜在变量与LSTM隐藏状态之间的信息互信息，以防止运动模式崩溃。
用对抗性损失（视频与图像判别器）与对比/保持内容一致性的损失（InfoNCE）相结合进行训练，以实现帧的一致性。
通过与预训练的图像生成器（如 StyleGAN2 和 BigGAN）集成，支持HD生成。

实验结果

研究问题

RQ1一个固定、预训练的图像生成器是否可以通过学习潜在空间的运动轨迹来合成高质量、时间一致的HD视频？
RQ2在潜在空间中解耦运动与内容是否能够实现跨域视频合成，即图像域和运动域来自不同数据集？
RQ3哪种判别器组合与辅助损失最能在保持内容保真度的同时生成真实的时间动态？
RQ4MoCoGAN-HD在标准基准测评和跨域场景下对比最先进的视频生成方法的表现如何？

主要发现

在视频生成基准测试（如 UCF-101、FaceForensics、Sky Time-lapse）上实现了最先进的结果，且帧具有高分辨率。
在 UCF-101 上，该方法达到 Inception Score 33.95 与 Fréchet Video Distance 700.00（相较于以往方法）。
对于 FaceForensics，该方法达到 Fréchet Video Distance 53.26 与 Average Content Distance 0.3300，在对比判断中获得 73.6% 的人工偏好。
在 Sky Time-lapse 上，该模型在 FVD（例如 77.77）上显著优于 MDGAN 和 DTVNet，并在预测帧时达到 PSNR/SSIM 22.286/0.688。
该框架实现跨域视频合成（如 FFHQ 与 VoxCeleb、LSUN-Church 与 TLVDB、AFHQ-Dog 与 VoxCeleb、AnimeFaces 与 VoxCeleb）在分辨率高达 1024×1024 的场景，展示了对内容域之间的运动迁移能力。
消融研究显示对比图像判别器、视频判别器、运动残差表示和互信息损失对于多样性与保真度的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。