QUICK REVIEW

[論文レビュー] VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Xin Li, Wenqing Chu|arXiv (Cornell University)|Sep 1, 2023

Generative Adversarial Networks and Image Synthesis被引用数 14

ひとこと要約

VideoGen はテキストへ画像の参照をガイドとして用いる潜在拡散パイプラインを用い、テキストから画像への参照を利用して高解像度で時間的に一貫したビデオを生成し、デコーダのためのテキスト-ビデオ学習データを必要とせず、標準T2Vベンチマークで最先端の結果を達成します。

ABSTRACT

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.

研究の動機と目的

大量の画像-テキストデータを活用して高品質で時間的一貫性のあるテキスト→ビデオ生成を動機づける。
拡散ベースのビデオ合成を導く高品質なT2I生成参照画像を使用してビデオ内容の忠実度を向上させる。
ラベルなしビデオ上でのビデオデコーダの訓練を可能にし、運動のリアリズムと時間的整合性を改善する。
高解像度出力のための流れベースの時間的アップサンプリングを備えた階層的潜在拡散フレームワークを開発する。

提案手法

frozen text-to-image モデル（Stable Diffusion）を用いて入力テキストプロンプトから参照画像を生成する。
参照画像とテキストプロンプトの両方を条件とする参照ガイド付き階層潜在ビデオ拡散モデルを用いて、低～中解像度の潜在ビデオ表現の連続を生成する。
潜在空間でのフロー基づく時間的超解像モジュールを適用して時間解像度をアップサンプリングする（ステップごとに2x、最大8xまで）。
前処训练済みの画像デコーダから初期化された強化型ビデオデコーダを用いて潜在ビデオ表現を高解像度のビデオへマッピングする。時間的畳み込みと注意機構を取り入れる。
テキスト-ビデオのペア（WebVid-10M）で階層的潜在拡散ネットワークを訓練すると同時に、ビデオデコーダと時間的超解像を非対の高品質ビデオで訓練する。訓練中の参照画像はビデオの最初のフレームである。

実験結果

リサーチクエスチョン

RQ1テキストから生成された参照画像は、テキスト-ビデオ拡散の忠実度と運動学習を改善できるか。
RQ2参照ガイド付き潜在拡散とフロー基づく時間的アップサンプリング、別個のビデオデコーダを組み合わせることで、従来のT2V手法よりも視覚的忠実度と時間的一貫性が高くなるか。
RQ3非対のビデオでビデオデコーダを訓練すると運動のリアリズムと全体的なビデオ品質にどのような影響があるか。
RQ4高品質な参照画像を拡散条件付けに統合することが標準的なT2V指標に与える影響は何か。

主な発見

Table 1: T2V results on UCF-101	Table 2: T2V results on MSR-VTT
CogVideo (Chinese)	Yes	Yes	480 × 480	23.55	751.34
CogVideo (English)	Yes	Yes	480 × 480	25.27	701.59
Make-A-Video	Yes	Yes	256 × 256	33.00	367.23
Ours	Yes	Yes	256 × 256	71.61 ± 0.24	554 ± 23
TGANv2	No	No	128 × 128	26.60 ± 0.47	-
DIGAN	No	No	-	32.70 ± 0.35	577 ± 22
MoCoGAN-HD	No	No	256 × 256	33.95 ± 0.25	700 ± 24
CogVideo	Yes	Yes	160 × 160	50.46	626
VDM	No	No	64 × 64	57.80 ± 1.3	-
LVDM	No	No	256 × 256	-	372 ± 11
TATS-base	Yes	Yes	128 × 128	79.28 ± 0.38	278 ± 11
Make-A-Video	Yes	Yes	256 × 256	82.55	81.25
Ours	Yes	Yes	256 × 256	82.78 ± 0.34	345 ± 15
GODIVA	No	Yes	128 × 128	0.2402	-
Nüwa	No	336 × 336	0.2439	-
CogVideo (Chinese)	Yes	Yes	480 × 480	0.2614	-
CogVideo (English)	Yes	Yes	480 × 480	0.2631	-
Make-A-Video	Yes	Yes	256 × 256	0.3049	-
Ours	Yes	Yes	256 × 256	0.3127	-

VideoGen は定性的・定量的評価において UCF-101 および MSR-VTT で最先端の結果を達成。
ゼロショットの UCF-101 では VideoGen の IS スコアが 71.61±0.24 で、ベースライン（2番目は約 33–57 範囲）を上回る。
MSR-VTT ではゼロショット設定で最高の平均 CLIPSIM スコア（0.3127）を達成。
アブレーションにより参照画像を除くと CLIPSIM が 0.2534、IS が 26.64±0.47 へ低下する一方、T2I 参照画像を含めると両指標が改善される。
フロー基づく時間的アップサンプリングは、フロー非誘導の補間と比較してフレームの連続性と安定性を向上させる。
非対ビデオで訓練されたビデオデコーダは、ベースラインと比較してよりシャープな質感とより良い時間的滑らかさを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。