QUICK REVIEW

[論文レビュー] VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Fuchen Long, Zhaofan Qiu|arXiv (Cornell University)|Jan 2, 2024

Video Analysis and Summarization被引用数 6

ひとこと要約

VideoDrafter は、LLM 主導のマルチシーンスクリプトを用いて拡散ベースの動画生成を指示し、エンティティ参照画像とシーン用および動画用の2つの拡散モデルによりコンテンツの一貫性を持つマルチシーン動画を作成し、SOTAベースラインを上回る。

ABSTRACT

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{https://github.com/FuchenUSTC/VideoStudio}.

研究の動機と目的

LLM を用いてプロンプトを構造化されたマルチシーン動画スクリプトに変換し、シーン間の論理を捉える。
シーン間で共通のエンティティを特定し活用して、外観の一貫性を維持する。
各エンティティごとの参照画像を生成してシーンを連結し、動画生成を導く。
プロンプト、参照画像、カメラ移動を条件とした拡散モデルでシーン動画を生成する。
最先端手法と比較して、視覚品質とコンテンツの一貫性の優位性を示す。

提案手法

3段階のフレームワーク: (1) LLM（ChatGLM3-6B）を用いたマルチシーンスクリプト生成で、プロンプトをシーンプロンプト、前景/背景、カメラ移動へ分解。
(2) 共通エンティティの参照画像を Stable Diffusion で生成し、前景/背景を分離するために U2-Net 分割で精練してエンティティ参照画像を作成。
(3) 2 つの拡散ブランチを用いた動画シーン生成：VideoDrafter-Img はイベントプロンプトとエンティティ参照に条件付けたシーン参照画像を作成し、VideoDrafter-Vid はシーン参照画像、アクション語彙、カメラ移動に条件付けてクリップを作成し、時間的注意とフレームワープでカメラの動きを反映させる。

実験結果

リサーチクエスチョン

RQ1LLM 生成のマルチシーンスクリプトは、シーン間の論理的一貫性をどのように向上させることができるか？
RQ2エンティティレベルの参照画像は、マルチシーン動画におけるシーン間のコンテンツ一貫性を保証できるか？
RQ3スクリプトと参照に条件付けられた拡散ベースのシーン・動画モデルは、既存の単一シーンおよびマルチシーンの動画生成法より優れているか？
RQ4時間的ダイナミクスとカメラ運動の組み込みが動画品質と一貫性に与える影響は何か？

主な発見

VideoDrafter は、複数のベンチマークで最先端モデルより優れた視覚品質とコンテンツの一貫性を達成する。
エンティティ参照画像の組み込みは、シーン間の一貫性とプロンプトとの整合性を向上させる。
2 段階の拡散アプローチ（シーン参照画像生成と動画生成）は、シーン間で一貫したエンティティを効果的に保持する。
人間評価は、LLM 主導のスクリプト化と参照画像を使用した場合に、視覚品質、論理的整合性、およびコンテンツの一貫性の向上を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。