QUICK REVIEW

[論文レビュー] A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Yu Tian, Jian Ren|arXiv (Cornell University)|Apr 30, 2021

Generative Adversarial Networks and Image Synthesis参考文献 72被引用数 36

ひとこと要約

この論文（MoCoGAN-HD）は、固定の事前訓練済み画像生成器を、潜在空間の学習可能なモーション軌道と組み合わせることで、高品質・高解像度の動画を生成できることを示し、ドメイン横断の動画合成と大幅な効率向上を可能にする。

ABSTRACT

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.

研究の動機と目的

固定の事前訓練済み画像生成器が、潜在空間のモーション軌道を学習することで高解像度の動画合成を駆動できることを示す。
内容とモーションを分離して、柔軟な動画操作とドメイン横断合成を可能にする。
HD解像度（最大1024×1024）までの動画生成の効率を向上させる。
画像ドメインとモーションドメインが異なるデータセットから来るクロスドメイン動画合成を導入する。

提案手法

二つのLSTMを用いたモーションジェネレータで、共有画像潜在空間の潜在軌道を予測する。
各フレームの潜在コードを、前のコード周りの残差として表現し、潜在方向のPCAベース基底によって計算する。
コンテンツ一貫性を強制する対比的（コントラスト）画像識別器と、現実的なモーションパターンを学習するマルチスケール動画識別器を用いる。
モーション潜在変数とLSTM隠れ状態との相互情報量を最大化して、モーションモード崩壊を防ぐ。
対（敵対的）損失（動画識別器と画像識別器）と、フレーム整合性のためのコントラスト/内容保持損失（InfoNCE）の組み合わせで訓練する。
StyleGAN2やBigGANなどの事前訓練済み画像生成器と統合してHD生成をサポートする。

実験結果

リサーチクエスチョン

RQ1固定の事前訓練済み画像生成器を用いて、潜在空間のモーション軌道を学習することで、高品質で時間的一貫性のあるHD動画を合成できるか。
RQ2潜在空間でモーションと内容を分離することは、画像ドメインとモーションドメインが異なるデータセットから来るクロスドメイン動画合成を可能にするか。
RQ3どの識別器と補助損失の組み合わせが、内容の忠実度を保ちつつ現実的な時間的ダイナミクスを最もよく維持するか。
RQ4MoCoGAN-HDは、標準ベンチマークやクロスドメインシナリオで最先端の動画生成手法と比較してどうなるか。

主な発見

動画生成ベンチマーク（例：UCF-101、FaceForensics、Sky Time-lapse）で高解像度フレームを用いた最先端の結果を達成。
UCF-101では、Inception Scoreが33.95、Fréchet Video Distanceが700.00（従来法と比較）に達する。
FaceForensicsでは、Fréchet Video Distanceが53.26、Average Content Distanceが0.3300で、ベースラインに対してペアワイズ判断で73.6%の人間の好みを獲得。
Sky Time-lapseでは、MDGANおよびDTVNetよりもFVDが大幅に上回り（例：77.77）、予測フレームでのPSNR/SSIMが22.286/0.688を達成。
このフレームワークは、FFHQとVoxCeleb、LSUN-ChurchとTLVDB、AFHQ-DogとVoxCeleb、AnimeFacesとVoxCelebなど、1024×1024までの解像度でクロスドメイン動画合成を実現し、コンテンツドメイン間のモーション転送を示す。
アブレーション研究は、対照的な画像識別器、動画識別器、モーション残差の定式化、相互情報損失が多様性と忠実度にとって重要であることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。