QUICK REVIEW

[論文レビュー] WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Xiaofeng Wang, Zheng Zhu|arXiv (Cornell University)|Jan 18, 2024

Generative Adversarial Networks and Image Synthesis被引用数 5

ひとこと要約

WorldDreamer は、トランスフォーマー枠組み内でマスクされた視覚トークンを予測することにより、テキストから動画、画像から動画、編集、および多様なシーンにわたるアクション条件付き動画生成を可能とする一般的な世界モデルを訓練します。

ABSTRACT

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

研究の動機と目的

ゲーム/ロボティクスを超える多様な実世界のダイナミクスに対応できる一般的な世界モデルの必要性を動機づける。
大規模言語モデルに触発された動画モデリングのトークン予測パラダイムを提案する。
動画内の運動と物理を効率的に学習する Spatial Temporal Patchwise Transformer (STPT) を開発する。
動画生成と編集を指示するマルチモーダルプロンプト（テキストとアクション）を有効にする。
自然な風景、運転シナリオ、および複数の生成/編集タスクにわたる適用性を示す。

提案手法

VQGAN で視覚情報を離散トークンへエンコードし、マスクされたトークン予測をモデル化する。
テキストを T5 埋め込みで、アクションを MLP で表現し、マルチモーダルプロンプトを形成する。
Spatial Temporal Patchwise Transformer (STPT) を用いて局所的な時空パッチ内でアテンションを行い、マルチモーダルプロンプトとのクロスアテンションを適用する。
並列トークン予測を可能にし情報漏洩を減らすため、コサインスケジュール方式の動的マスキング戦略で訓練する。
未マスクトークンとマルチモーダルプロンプトに条件付けてマスクされたトークンを予測するため、クロスエントロピー損失で最適化する。
自己収集データと nuScenes に対して STPT の全パラメータでファインチューニングし、時間空間理解を向上させる。

実験結果

リサーチクエスチョン

RQ1視覚トークンから学習した一般的な世界モデルは、さまざまな実世界の場面でダイナミクスと物理を予測できるか。
RQ2マルチモーダルプロンプト（テキストとアクション）を統合しつつ STPT が時空ダイナミクスを捉える際の有効性はどの程度か。
RQ3テキストから動画、画像から動画、インペインティング、スタイライズ、およびアクションから動画など、複数の生成/編集タスクをサポートできるか。
RQ4並列マスクトークン予測は、拡散/自己回帰アプローチより速度と品質の利点を提供するか。

主な発見

WorldDreamer は自然な風景と運転シナリオの両方で動画を生成する。
本モデルはテキストから動画、画像から動画、動画編集、そしてアクションから動画生成をサポートする。
画像と動画データの結合訓練とマルチモーダルプロンプトは時空の理解を向上させる。
推論は並列マスクトークン予測を使用し、拡散ベースの方法より約3倍速いデコードを実現する。
CFG ガイダンスにより推論時の生成品質が向上する。
単一の A800 GPU で 192x320 の 24 フレームを 3 秒で生成。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。