QUICK REVIEW

[論文レビュー] Structure and Content-Guided Video Synthesis with Diffusion Models

Patrick Esser, Johnathan Chiu|arXiv (Cornell University)|Feb 6, 2023

Generative Adversarial Networks and Image Synthesis被引用数 17

ひとこと要約

本論文は、入力構造を保持しつつ、テキストまたは画像のプロンプトに従って動画を編集する、構造と内容に導かれた潜在動画拡散モデルを提案する。ジョイントの画像-動画トレーニング、深度ベースの構造、そして時間的一貫性を制御する新しいガイダンス手法を用いる。

ABSTRACT

Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

研究の動機と目的

コンテンツを編集しつつ構造を保持する制御可能な動画拡散モデルを開発する。
個別の動画ごとの訓練を必要とせず、テキストおよび画像誘導による動画編集を実現する。
時間的、一貫性、構造忠実度を明示的に制御できるようにする。
fidelity を変化させた深度ベースの構造表現で訓練を行い、忠実度を調整する。
編集のカスタマイズとユーザーの好みをデモンストレーションする。

提案手法

事前学習済みの画像モデルに対して時間層を追加することで、潜在拡散モデルを時空間ドメインへ拡張する。
構造をモノクロ depth 推定、内容をCLIPベースの埋め込みで表現する。
推論時の時間的制御を可能にするため、イメージと動画の共同訓練を行う。
denoise（デノイズ）時に構造 s を連結で、内容 c をクロスアテンションでモデルに条件付けする。
訓練および推論時に異なるぼかし t_s を用いた depth マップで構造忠実度を制御する。
分類子フリー拡散ガイダンスを、コンテンツと時間的ガイダンススケールを用いて、プロンプト忠実度と時間的一貫性を調整する。

実験結果

リサーチクエスチョン

RQ1拡散モデルは入力動画の元の構造を保持しつつ、動画の内容をどのように編集できるか？
RQ2画像と動画の共同訓練は、推論時に明示的な時間的一貫性制御を提供できるか？
RQ3深度ベースの構造表現と CLIP ベースの内容表現を、動画拡散モデルでどのように効果的に条件付けできるか？
RQ4サンプリングガイダンスと構造の詳細レベルを用いて、編集忠実度と時間的な滑らかさをどの程度制御できるか？

主な発見

本モデルは、推論時の時間的一貫性、構造忠実度、および内容編集を細かく制御できる。
画像と動画データの共同訓練により、画像のみのアプローチと比較して時間的一貫性が向上する。
異なる詳細度（t_s）を持つ深度ベースの構造表現により、編集時に構造をどれだけ保持するかを制御できる。
テキストプロンプトや CLIP 埋め込みを介して、コンテンツを指示でき、テキストを画像埋め込みへ変換する学習済みプリオリを用いる。
サンプリング中の新規の時間ガイダンス機構（ω_t）は、プロンプト遵守を維持しつつフレーム間の一貫性を改善する。
ユーザ調査では、テキストおよび画像誘導による動画編集において、いくつかのベースラインより本アプローチが好まれることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。