QUICK REVIEW

[論文レビュー] VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Han Lin, Abhay Zala|arXiv (Cornell University)|Sep 26, 2023

Multimodal Machine Learning Applications被引用数 8

ひとこと要約

VideoDirectorGPTはLLMベースの計画段階で多シーン動画計画を作成し、Layout2Vidによるレイアウトガイド付き動画生成で単一プロンプトから時間的一貫性のある長編動画を生成する。パラメータの小さなサブセットのみを更新して訓練効率を確保。

ABSTRACT

Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules. This prompts an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which includes the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities. Next, guided by this video plan, our video generator, named Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities across multiple scenes, while being trained only with image-level annotations. Our experiments demonstrate that our proposed VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with consistency, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. Detailed ablation studies, including dynamic adjustment of layout control strength with an LLM and video generation with user-provided images, confirm the effectiveness of each component of our framework and its future potential.

研究の動機と目的

LLMsを活用して単一テキストプロンプトから多シーン動画コンテンツを計画する。
T2V生成における明示的な空間レイアウト制御とシーン間の時間的一貫性を可能にする。
画像レベルのアノテーションだけを用いてレイアウトガイド付き動画生成器を効率的に訓練する。
レイアウトの精度と動きの改善を示しつつオープンドメイン品質を維持。
ダイナミックなレイアウト案内強度とユーザー提供画像の統合への道筋を提供。

提案手法

二段階パイプライン：(i) GPT-4による動画計画でシーン記述、2Dレイアウトを持つエンティティ、背景、整合性グルーピングを生成；(ii) 計画に基づくLayout2Vidによるグラウンド付き動画生成。
動画計画は four components: 多シーン記述、2D境界ボックスを伴うエンティティ、背景、シーン間の整合性グルーピングを含む。
Layout2VidはModelScopeT2Vを基盤として大半のパラメータを凍結し、Guided 2D Attention（パラメータの13%）のみ訓練してレイアウト制御を画像レベルのアノテーションから可能にする。
エンティティのグラウンディングは結合画像+テキスト埋め込みを用い、シーン間の同一性を維持。CLIP画像特徴とテキスト特徴をBounding Box Fourier Featuresと組み合わせ。
デノイジングは二段階で行い、Guided 2D Attentionを用いた初期レイアウトガイド付きステップの後に標準ステップを実施する。アルファはレイアウトガイドデノイズの割合を制御。
訓練効率：Layout2Vidは0.64Mの画像レベルレイアウトアノテーションで訓練され、8個のA6000GPU上で50kステップ最適化。

実験結果

リサーチクエスチョン

RQ1LLM生成の動画計画は長編動画生成において複数シーン間の一貫性と制御を向上させるか？
RQ2画像レベルアノテーションで訓練されたレイアウトガイド付き動画生成器は視覚品質を維持しつつシーン間の時間的一貫性を達成できるか？
RQ3ダイナミックなレイアウト案内強度の制御が動画品質とレイアウト忠実度にどう影響するか？
RQ4ユーザー提供の例画像をレイアウトグラウンディング動画生成へ組み込めるか？
RQ5エンティティ同一性維持における画像+テキスト埋め込みの影響はどれくらいか？

主な発見

Model	Object	Count	Spatial	Scale	Overall Acc.%	Movement Direction Acc.%
ModelScopeT2V	89.8	38.8	18.0	15.8	40.8	30.5
VideoDirectorGPT (Ours)	97.1	77.4	61.1	47.0	70.6	46.5

VideoDirectorGPTはオブジェクト数、空間関係、スケールにおいて単一シーン生成の強力なベースライン（ModelScopeT2V）と比較して優れたレイアウト制御を達成。
フレームワークはオブジェクトの移動方向の正確さを大幅に改善し、LLM計画に導かれた時間的ダイナミクスを示す。
オープンドメインのMSR-VTTにおいて、VideoDirectorGPTはレイアウトと多シーン一貫性機能を追加しても競争力のある視覚品質とテキスト-動画整合性を維持。
Layout2Vidは画像レベルのレイアウトアノテーションを用いた訓練を効率化（パラメータのうち13%のみ更新）しつつ動画生成品質を維持。
画像とテキストの両方の埋め込みを用いたエンティティグラウンディングは、テキストのみのグラウンディングより時間的一貫性が高い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。