QUICK REVIEW

[Paper Review] A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Yu Tian, Jian Ren|arXiv (Cornell University)|Apr 30, 2021

Generative Adversarial Networks and Image Synthesis72 references36 citations

TL;DR

This paper (MoCoGAN-HD) shows that high-quality, high-resolution video can be generated by composing a fixed, pre-trained image generator with a learnable motion trajectory in its latent space, enabling cross-domain video synthesis and substantial efficiency gains.

ABSTRACT

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.

Motivation & Objective

Demonstrate that a fixed, pre-trained image generator can drive high-resolution video synthesis by learning a latent motion trajectory.
Disentangle content and motion to enable flexible video manipulation and cross-domain synthesis.
Improve efficiency of video generation to enable HD resolutions up to 1024×1024.
Introduce cross-domain video synthesis where image and motion domains come from different datasets.

Proposed method

Use a motion generator with two LSTMs to predict a latent trajectory in the shared image latent space.
Represent frame-specific latent codes as residuals around the previous code, computed via a PCA-based basis of latent directions.
Employ a contrastive image discriminator to enforce content consistency and a multi-scale video discriminator to learn realistic motion patterns.
Maximize mutual information between motion latent variables and LSTM hidden states to prevent motion mode collapse.
Train with a combination of adversarial losses (video and image discriminators) and a contrastive/content-preserving loss (InfoNCE) for frame consistency.
Support HD generation by integrating with pre-trained image generators such as StyleGAN2 and BigGAN.

Experimental results

Research questions

RQ1Can a fixed, pre-trained image generator be used to synthesize high-quality, temporally coherent HD videos by learning a latent-space motion trajectory?
RQ2Does disentangling motion and content in the latent space enable cross-domain video synthesis where image and motion domains come from different datasets?
RQ3What combination of discriminators and auxiliary losses best preserves content fidelity while producing realistic temporal dynamics?
RQ4How does MoCoGAN-HD perform against state-of-the-art video generation methods on standard benchmarks and cross-domain scenarios?

Key findings

Achieves state-of-the-art results on video generation benchmarks (e.g., UCF-101, FaceForensics, Sky Time-lapse) with high-resolution frames.
On UCF-101, the method reaches an Inception Score of 33.95 and a Fréchet Video Distance of 700.00 (vs. prior methods).
For FaceForensics, the approach attains a Fréchet Video Distance of 53.26 and an Average Content Distance of 0.3300, with 73.6% human preference in pairwise judgments over a baseline.
On Sky Time-lapse, the model substantially outperforms MDGAN and DTVNet in FVD (e.g., 77.77) and achieves PSNR/SSIM of 22.286/0.688 when predicting frames.
The framework enables cross-domain video synthesis (e.g., FFHQ with VoxCeleb, LSUN-Church with TLVDB, AFHQ-Dog with VoxCeleb, AnimeFaces with VoxCeleb) at resolutions up to 1024×1024, demonstrating motion transfer across content domains.
Ablation studies show the importance of the contrastive image discriminator, the video discriminator, motion residual formulation, and the mutual information loss for diversity and fidelity.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.