QUICK REVIEW

[論文レビュー] FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Yuren Cong, Mengmeng Xu|arXiv (Cornell University)|Oct 9, 2023

Video Analysis and Summarization被引用数 10

ひとこと要約

FLATTENは拡散モデルベースのテキストから動画編集にフローガイド型注意機構を導入し、光学フロー軌道に沿ったパッチレベルの一貫性を訓練なしで実現、TGVEベンチマークで最先端の成果を達成します。

ABSTRACT

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

研究の動機と目的

フレーム間の時系列整合性を活用して、テキストから動画編集時の視覚的不整合を解消する。
訓練を必要としないフローガイド型注意機構（FLATTEN）を事前学習済みのT2I拡散モデルと互換性を持たせて導入する。
光学フロー軌道を通じてフレームごとの特徴分布を保持しつつ、フレーム間の一貫性を向上させる。
FLATTENが編集品質を向上させ、既存のT2V編集手法に組み込むことができることを示す。

提案手法

事前学習済みのテキストから画像への拡散U-Netを時系列軸に沿って膨張させ、T2V編集フレームワークを作成する。
密な時空間注意を、光学フローにより案内されるパッチ軌道を用いるフローガイド型注意（FLATTEN）に置換する。
RA F T推定フローを latent 空間解像度にダウンサンプリングしてパッチ軌道を計算し、同じ軌道からQ/K/Vを取得して注意のQ/K/Vに用いる。
訓練を行うことなく、DDIMの反転およびサンプリング時にFLATTENを適用して時間的一貫性を向上させる。
サンプリング時に画像編集の実践に従って拡散特徴を注入し、フレームごとの一貫性を高める。
新たな訓練可能パラメータは導入されず、FLATTENは既存の射影層と注意ブロックを再利用する。

実験結果

リサーチクエスチョン

RQ1訓練を要さずに光学フローのガイダンスは、テキストから動画編集におけるフレーム間の一貫性をどう改善できるか。
RQ2Flow-guided attentionを拡散ベースのT2V編集に統合することで、ベースラインと比べて視覚的一貫性とテキスト忠実度は改善されるか。
RQ3FLATTENを他の拡散ベースT2V手法へ組み込んで性能を向上させることは可能か。
RQ4DDIMの反転時にFLATTENを適用するのと、サンプリング時のみ適用するのとで影響はどう異なるか。

主な発見

方法	CLIP-F ↑	PickScore ↑	CLIP-T ↑	E warp ↓	S_edit ↑
TGVE-D - FLATTEN (ours)	92.49	20.95	28.05	4.92	57.01
TGVE-V - FLATTEN (ours)	96.75	20.63	26.70	3.16	84.49

FLATTENは、TGVE-DおよびTGVE-Vベンチマークで、テキスト整合性と編集品質の新しい最先端性能を達成する。
TGVE-DではCLIP-F 92.49、PickScore 20.95、CLIP-T 28.05、E warp 4.92、S_edit 57.01；TGVE-VではCLIP-F 96.75、PickScore 20.63、CLIP-T 26.70、E warp 3.16、S_edit 84.49を記録。
FLATTENは報告済み手法の中で最良または同等のCLIP-TとS_editスコアを提供し、基準に対して戦 Warp誤差(E warp)を減少させつつ競争力のあるCLIP-Fを維持する。
ControlVideoへFLATTENを組み込むと視覚的一貫性が向上し、E_warpを6.81から4.78へ低減し、S_editを40.70から56.42へ向上させる。
アブレーションによりDSTAとFLATTENを組み合わせた方法（アプローチII）は、S_editの大幅な改善と、単独で使用した場合と比べて時間的不整合を減少させることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。