QUICK REVIEW

[論文レビュー] Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Jie An, Songyang Zhang|arXiv (Cornell University)|Apr 17, 2023

Generative Adversarial Networks and Image Synthesis被引用数 27

ひとこと要約

Latent-Shift は、パラメーターフリーの時間シフトモジュールを用いて事前学習済みのテキストから画像への潜在拡散モデルを微調整し、効率的なテキストから動画生成を可能にします。追加の時間モジュールなしでパラメータを削減しつつ、競争力のある結果を得ます。

ABSTRACT

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.

研究の動機と目的

新たな時間的パラメータを追加せずに、テキストから画像への潜在拡散モデル（LDM）をテキストから動画（T2V）生成へ拡張する。
U-Net における時間的ダイナミクスをモデル化するためのパラメータフリーの時間シフトモジュールを導入する。
微調整された Latent-Shift がテキストから画像生成も実行できることを示す。
MSR-VTTとUCF-101での効率性と有効性を、ユーザー調査を含めて示す。

提案手法

フレームを潜在空間にエンコードするための事前学習済みオートエンコーダと、その潜在空間でT2Iの学習を行ったU-Net拡散モデルを使用する。
新規パラメータを追加せずに動きを学習できるよう、2D ResNet ブロックの残差ブランチに時間シフトモジュールを組み込む。
時間次元に沿ってチャネル群の1/3を前方へ、1/3を後方へシフトさせ、列の末端にはゼロパディングを施す。
テキスト条件付けのためにトランスフォーマーブロックでクロスアテンションを適用し、サンプリング時には分類器なしガイダンスを用いる。
エンコードされたフレーム列に対する潜在動画拡散目標と、LDMs に類似したテキスト条件付きノイズ除去損失を用いて訓練する。

実験結果

リサーチクエスチョン

RQ1パラメータフリーの時間シフトモジュールは、時間畳み込みや注意機構を追加せずに、T2V の効果的な時系列モデリングを実現できるだろうか？
RQ2Latent-Shift は、品質と効率の点で、時間アテンションベースの潜在動画拡散モデルとどのように比較されるか？
RQ3時間シフトを用いてT2Iモデルを微調整することは、T2I機能を保持しつつT2V生成を可能にするだろうか？
RQ4標準ベンチマークでのモデルサイズ、速度、動画品質のトレードオフはどのようになるか？

主な発見

手法	ゼロショット	FID ↓	CLIPSIM ↑
CogVideo	Yes	23.59	0.2631
Make-A-Video	Yes	13.17	0.3049
Latent-VDM	Yes	14.25	0.2756
Latent-Shift (ours)	Yes	15.23	0.2773

Latent-Shift は、MSR-VTT において既存手法と比較して競争力のあるゼロショット T2V 結果を達成する。
品質を維持しつつ、時間アテンションベースの潜在動画モデルより少ないパラメータで高速に動作する。
UCF-101 の評価では、Latent-Shift は高い IS を達成し、競争力のある FVD で最先端の数値に近づく。
ユーザー調査では、品質と信頼性の点で Latent-Shift が CogVideo より好まれる。
Latent-Shift は T2V の微調整後も T2I 生成能力を維持する。
パラメータフリーの時間シフトは、追加の時間モジュールなしで動きを学習することを可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。