QUICK REVIEW

[論文レビュー] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Levon Khachatryan, Andranik Movsisyan|arXiv (Cornell University)|Mar 23, 2023

Generative Adversarial Networks and Image Synthesis被引用数 7

ひとこと要約

Text2Video-Zero は訓練を行わずにテキストプロンプトから時間的一貫性のある動画を生成する。事前訓練済みのテキストツーイメージ拡散モデルを動作ダイナミクスとフレーム間アテンションを組み込むよう改変する。

ABSTRACT

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

研究の動機と目的

ゼロショットのテキストからビデオ生成を訓練不要タスクとして導入する。
事前訓練済みのテキストツーイメージ拡散モデルを活用してビデオ系列を合成する。
潜在コードとフレーム間アテンションによりモーションダイナミクスを用いて時間的一貫性を強制する。
条件付き・専門的なビデオ生成とビデオ編集への広範な適用性を実証する。

提案手法

フレームの潜在コードをモーションダイナミクスで豊富化し、時間を通じたグローバルなシーン/背景を整合させる。
各フレームが最初のフレームを参照するフレーム間アテンションを適用し、前景の同一性を保持する。
モーションフィールドを用いてフレーム間で潜在表現をワープし、再度フォワード拡散を実行してモーションの自由度を確保する。
安定拡散の自己注意層をフレーム間アテンションに置換し、フレーム間の一貫性を維持する。
前景マスクに導かれたフレーム潜在コードとワープ第一フレーム潜在コードの凸結合による背景平滑化を任意で適用する。
条件付き/専門的な生成および Video Instruct-Pix2Pix を用いた指示に基づく編集との互換性を Demonstrate し、ControlNet および DreamBooth モデルと併用可能とする。
修正された潜在に対して DDIM サンプリングを適用し、ビデオ系列を生成する。

実験結果

リサーチクエスチョン

RQ1トレーニングや動画データでのファインチューニングなしでゼロショットのテキストからビデオ生成は実現可能か。
RQ2モーション認識潜在コードとフレーム間アテンションは動画生成の時間的一貫性と前景のアイデンティティ保持を改善するか。
RQ3追加の訓練なしに条件付き・専門的・指示ガイド付き編集シナリオへゼロショットの動画生成を拡張可能か。
RQ4提案アプローチはプロンプト整合性と時間安定性の点で既存のテキストからビデオ生成法とどのように比較されるか。

主な発見

本手法は訓練なしでテキストプロンプトから時間的一貫性のある動画生成を達成する。
潜在コードのモーションダイナミクスはグローバルなシーン/背景の時間的整合性を改善する。
フレーム間アテンションはフレーム間で前景の外観と同一性を保持する。
CogVideo と比べて CLIP ベースの整合性で競合的な性能を発揮する（31.19 対 29.63）。
再訓練を要することなく条件付きおよび専門的な生成と Video Instruct-Pix2Pix を可能にする。
qualitatively な結果はさまざまなプロンプトとガイダンスに対して高い Text-Video 整合性と時間的一貫性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。