QUICK REVIEW

[論文レビュー] Edit-A-Video: Single Video Editing with Object-Aware Consistency

Chaehun Shin, Heeseung Kim|arXiv (Cornell University)|Mar 14, 2023

Generative Adversarial Networks and Image Synthesis被引用数 12

ひとこと要約

Edit-A-Videoは、元の動画を反転させ、注意機構のマップを注入し、2D拡張で3D化した拡張モデルを用いてテキストプロンプトで導かれる単一の動画を編集する。背景の一貫性を保つ新規の時系列一貫性ブレンディング（temporal-consistent blending）を導入。

ABSTRACT

Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing. Motivated by approaches on TTV models adapting from diffusion-based text-to-image (TTI) models, we suggest the video editing framework given only a pretrained TTI model and a single pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules and tuning on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. Each stage enables the temporal modeling and preservation of semantic attributes of the source video. One of the key challenges for video editing include a background inconsistency problem, where the regions not included for the edit suffer from undesirable and inconsistent temporal alterations. To mitigate this issue, we also introduce a novel mask blending method, termed as sparse-causal blending (SC Blending). We improve previous mask blending methods to reflect the temporal consistency so that the area where the editing is applied exhibits smooth transition while also achieving spatio-temporal consistency of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.

研究の動機と目的

事前学習済みのテキスト～画像モデルのみを用い、単一の<text, video>ペアでテキスト指示による動画編集を促す。
2Dモデルを時系列モデリング用に3Dへ拡張し、 inversionと注意マップ注入による編集を実現する2段階フレームワークを開発する。
未編集領域を時間とともに維持する新規の時系列一貫性ブレンディング（TC Blending）で背景の不整合を緩和する。
異なる注意モジュール（Cross-Attention、Temporal Attention、ST-Attn）の役割を分析し、時系列的一貫性と内容保存に与える影響を検証する。

提案手法

事前学習済みの2D TTIモデルを時系列モジュールを追加し、2D畳み込みと自己注意をそれぞれ時系列対応へ変換して3D TTVモデルへ拡張する。
DDIM inversionを用いて元画像をGaussianノイズへ反転させ、編集時にも元を再構築できるようnull-text embeddingsを最適化する。
注入元の注意マップをターゲットのテキスト生成プロセスへ注入し、編集内容を元の空間レイアウトと整合させる。
フレーム間で背景を維持しつつ編集領域を狙うブレンディングマスクを生成するTemporal-Consistent Blending（TC Blending）を導入する。
マスク構築のために現在のフレーム特徴と最初のフレームおよび前のフレームを関連付けるスパースな時空間注意（ST-Attn）を計算する。
時系列的一貫性と編集忠実度を維持するためのCross-Attention、Temporal Attention、および ST-Attn の役割について分析を提供する。

実験結果

リサーチクエスチョン

RQ1テキスト指向の画像拡張モデルを動画モデルへ拡張し、単一ビデオで微調整した場合、ターゲットテキストに導かれた時間的に一貫した編集を生み出せるか。
RQ2注意マップ注入は、編集対象オブジェクトの忠実な編集と、各フレームでの未編集領域の保持を可能にするか。
RQ3TC Blendingは、フレームごとにシャープで時間的に一貫したマスクを生み出し、編集動画の背景不整合を低減できるか。
RQ4Cross-Attention、Temporal Attention、ST-Attn の異なる注意モジュールが編集品質と時間的一貫性に与える影響は何か。

主な発見

Method	ユーザー評価（O）	Text Alignment	LPIPS	PSNR
Edit-A-Video (Ours)	3.80±0.10	30.2688	0.2625	20.0992
Tune-A-Video	3.46±0.10	30.0514	0.4482	14.5753
SDEdit	3.40±0.10	28.4203	0.2711	20.4767
Video-P2P	3.66±0.10	30.0842	0.3047	17.5760

Edit-A-Videoは、背景の保存、テキスト整合、動画のリアリズムに関してベースラインと比較して優れたユーザー好みスコアを達成する。
定量的結果は、Edit-A-Videoが3.80±0.10のUser Score (O)、Text Alignment 30.2688、LPIPS 0.2625、PSNR 20.0992で、ほとんどの指標でTune-A-Video、SDEdit、Video-P2Pを上回ることを示している。
TC Blendingはターゲットオブジェクトのマスキングと背景の保持を改善し、アブレーション版より高いUser Scoresとより良いLPIPS/PSNRおよびMask IoUを示す。
アブレーション研究は、TC Blendingがよりシャープで時間的に一貫したマスクを生み出し、背景の不整合を低減することを示している。
Cross-Attention注入の継続時間（0.2）は空間的レイアウトを保持しつつターゲットセマンティクスを有効にする；Temporal Attention（0.8）は堅牢な時系列モデリングを示す；ST-Attn（0.5）は動的な動作と編集焦点のバランスを取る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。