QUICK REVIEW

[論文レビュー] MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou, Weimin Wang|arXiv (Cornell University)|Nov 20, 2022

Generative Adversarial Networks and Image Synthesis被引用数 63

ひとこと要約

MagicVideo は、軽量なフレームアダプタと指向性の時間的注意機構を備えた潜在拡散型の動画生成器を構築し、VideoVAEと教師なし事前学習を用いて品質を向上させつつ、1つのGPUで256x256のテキスト条件付き動画を高効率に生成します。

ABSTRACT

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

研究の動機と目的

テキストから動画生成におけるデータ効率と計算コストに対処する。
低次元潜在空間で動画分布をモデル化して計算量を削減する。
画像生成の事前学習済み重みを活用して動画学習を加速する。

提案手法

低次元の動画潜在空間で潜在拡散を用いて16個のキーフレームを生成する。
軽量なフレーム単位アダプタと指向性時間注意モジュールを備えた3D U-Netを導入し、時空間特徴をモデル化する。
計算量を削減し画像モデルの事前学習を再利用するため、3D/2+1D畳み込みをビデオ分布アダプタ（2Dアダプタ）に置換する。
フレームレベルのディザリングを低減するためにVideoVAEデコーダを取り入れる。
滑らかな動きを実現するため、キーフレーム間の中間フレームを合成する補間ネットワークを訓練する。
拡散に基づく超解像モデルを適用して256x256フレームを高解像度へアップスケールする。
CLIPベースのフレーム埋め込みを用いた教師なし事前学習を採用し、テキスト-動画ペアでファインチューニングする。

実験結果

リサーチクエスチョン

RQ1低次元潜在空間での潜在拡散は、時系列的一貫性とテキスト整合性を持つ動画を効果的に生成できるか。
RQ2フレーム単位アダプタと指向性時間注意は、従来の3D/2+1D動画モデルと比較して品質と時系列的一貫性を向上させるか。
RQ3動画データ上の教師なし事前学習が、テキスト-動画ペアでファインチューニングした場合の最終的な動画品質にどのように影響するか。
RQ4VideoVAEデコードの導入が動画生成におけるディザリングアーティファクトの低減に与える影響は何か。
RQ5SRモデルを介した高解像度アップサンプリングへの適用はどれくらいスケールするか。

主な発見

MagicVideo はテキストプロンプトと整合した高品質で時系列的一貫性のある動画生成を実現し、定性的比較でいくつかの強力なベースラインを上回る。
指向性自己注意機構は Frechet Video Distance (FVD) を低減し、一方向の時間ダイナミクスをモデル化することで運動的一貫性を向上させる。
フレームごとの2D畳み込みを用いたアダプタモジュールは、動画品質を維持または向上させつつ計算量を大幅に削減する。
教師なし事前訓練（CLIPフレーム埋め込みを使用）により動画品質が大幅に向上し、データセット間のアブレーションでFVDを約60低減する。
ゼロショット評価では、MagicVideo は baselines と比較して MSR-VTT および UCF-101 で競争力のあるまたは優れた FID/FVD スコアを達成する（例: MSR-VTT: FID 36.5, FVD 998; UCF-101: FID 145, FVD 655）。
時間的注意と統合された VideoVAE デコーダはフレームディザリングを緩和し、RGB再構成を滑らかにする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。