QUICK REVIEW

[論文レビュー] Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Jiaxi Gu, Shicong Wang|arXiv (Cornell University)|Sep 7, 2023

Generative Adversarial Networks and Image Synthesis被引用数 9

ひとこと要約

VidRD は、潜在特徴を再利用・拡散して長く時間的一貫性のあるテキスト→ビデオクリップを生成する、反復的な単一 LDM フレームワークを導入。時間認識デコーダの微調整と多様なデータ構成戦略を組み合わせ、 cascaded アプローチと比較してトレーニングの複雑さを削減しつつ、UCF-101 で競争力のある FVD と IS を達成。

ABSTRACT

Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $ extit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.

研究の動機と目的

計算資源制約の下で Latent Diffusion Models (LDMs) を用いたテキスト→ビデオ合成を動機づける。
初期クリップからの潜在特徴を再利用して長く一貫した動画を生成する統一的で反復的なフレームワークを開発する。
デコーダと時間認識 U-Net による時間的一貫性を向上させる。
画像-テキストおよびアクション認識ビデオデータセットを活用したデータ構成戦略を提案し、堅牢な学習を実現する。

提案手法

事前学習済みの Stable Diffusion LDM をベースに、U-Net に時間層（Temp-Conv および Temp-Attn）を追加して拡張する。
オートエンコーダのデコーダに時間層を注入し、追加した時間成分のみを微調整する。
反復を3つの生成モジュールとして導入する：Frame-level Noise Reversion (FNR)、Past-dependent Noise Sampling (PNS)、Denoising with Staged Guidance (DSG)。
クリップ間で初期ノイズを逆順で再利用する（FNR）；後半フレームには新しいランダムノイズを注入する（PNS）；新しい内容を許容しつつ一貫性を維持するため段階的ガイダンスを適用する（DSG）。
画像-テキストデータを pseudo-/videos へ変換することで video-text データを構成し、BLIP-2 で短いビデオを注釈付け、CLIP と MiniGPT-4 で長いビデオをセグメントして処理する。

実験結果

リサーチクエスチョン

RQ1単一の拡散モデルを用いて、別個の予測部品を学習させることなく長く時間的一貫性のあるビデオを生成できるか。
RQ2時間的一貫性を改善し、ビデオクリップ間のコンテンツの循環を減らす機構（FNR、PNS、DSG）はどのような効果をもたらすか。
RQ3多様で現実的にキャプション付けされたマルチソースデータセットは、LDM をビデオ生成のために効果的に訓練できるか。
RQ4VidRD は標準的なビデオ生成ベンチマークで、FVD と IS の観点でどのように性能を示すか。

主な発見

VidRD は UCF-101 において、Fréchet Video Distance (FVD) が 363.19、Inception Score (IS) が 39.37 という競争力のある定量的結果を達成。
時間的モジュールと反復的生成により、複数のカスケードモデルを用いずに長く滑らかな動画を実現。
画像-テキストおよびアクション認識データセットを用いた統合的な学習アプローチにより、ビデオ-テキストの整合性が堅牢に得られる。
Frame-level Noise Reversion、Past-dependent Noise Sampling、Denoising with Staged Guidance の3つの要素が、クリップ間の時間的一貫性を総じて向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。