QUICK REVIEW

[論文レビュー] TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal|arXiv (Cornell University)|Jul 19, 2023

Generative Adversarial Networks and Image Synthesis被引用数 40

ひとこと要約

TokenFlow は、フレーム間対応を用いて拡散特徴をフレーム間に伝播することで動画編集の時系列的一貫性を強制し、訓練なしで高品質なテキスト駆動動画編集を可能にします。フレームごと編集のベースラインに対して最先端の時系列コヒーレンスを達成します。

ABSTRACT

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

研究の動機と目的

事前学習済みの画像拡散モデルを用いて、動画編集品質と時系列的一貫性を向上させる動機づけ。
拡散特徴空間を活用して、編集時のフレーム間の一貫性を強制する。
オフ・ザ・シェルフの画像編集手法と互換性のある、訓練不要のフレームワークを提供する。
多様な現実のビデオにおいて最先端の時系列コヒーレンスを示す。
拡散特徴の特性とそれらが動画の冗長性とどのように関連するかを分析する。

提案手法

層を横断して、DDIM 反転された動画フレームから拡散トークンを抽出します。
複数のフレームにわたる拡張アテンションを介して共有されたグローバル外観を誘導するために、キーフレームの集合をサンプリングして共同編集します。
元の拡散特徴空間における最近傍対応を用いて、編集済みトークンを非キーフレームへ伝播します。
各デノイジングステップで TokenFlow の伝播とキーフレーム編集を組み合わせて、フレーム間の一貫性を維持します。
任意の画像編集拡散法（PnP、Meng et al.、Zhang & Agrawala など）で伝播をサポートします。
warp-error とユーザ調査によって時系列的一貫性を評価し、CLIP類似度によって忠実度を評価します。

Figure 2: Fine-grained feature correspondences. Features (i.e., output tokens from the self-attention modules) extracted from of a source frame are used to reconstruct nearby frames. This is done by: (a) swapping each feature in the target by its nearest feature in the source, in all layers and all

実験結果

リサーチクエスチョン

RQ1事前学習済みの画像拡散モデルを使用した場合、拡散特徴空間の一貫性はより時系列的に一貫した動画編集を生み出せるか？
RQ2キーフレームの共同編集と特徴空間伝播の組み合わせは、構造と運動を保持する点でフレーム単位の編集ベースラインを上回るか？
RQ3拡散特徴は自然動画の時系列的な冗長性をどのように反映し、それをより良い編集に活用できるか？

主な発見

Warp-err (×10^-3)	ユーザーの好み	CLIP
LDM recon.	2.0	-	0.23
PnP-Diffusion	11.3	94%	0.33
Text2Video-Zero	12.5	78%	0.33
Tune-a-Video	30.0	82%	0.31
Fate-Zero	6.9	71%	0.32
Gen1	-	70%	0.32
Rerender-a-Video	1.8	71%	0.32
Ours w joint attention	5.9	90%	0.33
Ours w/o rand keyframes	3.7	-	0.33
Ours	3.0	-	0.33

TokenFlow はベースラインより高い時系列コヒーレンスを達成し、Warp-err が低く、ユーザーの支持も強い。
joint attention を用いた ours とランダム化されたキーフレームを組み合わせた場合、時系列的一貫性でフレーム単位の編集ベースラインを上回る。
本手法は比較対象の中で最高の CLIP スコアを達成し、ターゲットプロンプトとの整合性が良いことを示している。
定性的な結果は、さまざまな動画において元の運動と意味的レイアウトを保持した編集を示している。
アブレーションにより、TokenFlow が拡張アテンションのみを上回り、ランダムなキーフレームが堅牢性を高めることが示される。
表形式の定量的結果は、warp-error、CLIP類似性、ユーザーの好みで明確な改善を示す。

Figure 3: Diffusion features across time. Left: Given an input video (top row), we apply DDIM inversion on each frame and extract features from the highest resolution decoder layer in $\epsilon_{\theta}$ . We apply PCA on the features (i.e., output tokens from the self-attention module) extracted fr

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。