QUICK REVIEW

[論文レビュー] GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models

Sai Sree Harsha, Ambareesh Revanur|arXiv (Cornell University)|Apr 18, 2024

Generative Adversarial Networks and Image Synthesis被引用数 6

ひとこと要約

GenVideo はターゲット画像と形状認識 InvEdit マスクを用いた潜在補正で、ターゲット形状がソースと異なる場合でも時間的一貫性を保つ動画編集を実現します。

ABSTRACT

Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.

研究の動機と目的

テキストだけでは不十分な場合に、視覚ガイドとしてターゲット画像を活用して正確な動画編集を可能にする。
ターゲットオブジェクトの形状とサイズがソースオブジェクトと異なる場合の編集を可能にする。
編集中のフレーム間で時間的一貫性を維持する。
画像条件付き拡散モデルに適応可能なマスク主導の推論フレームワークを提供する。

提案手法

ターゲット画像とテキスト条件付けを受け付けるように、ソース動画上で inflated SD-unCLIP モデルをファインチューニングする。
源とターゲットの DDIM ステップごとのデノイズノイズを比較して、ターゲット画像と形状認識 InvEdit マスクを生成する。
UNet 推論中に latent fusion スキームを用いてマスク領域へターゲット画像埋め込みを注入する。
推論中に潜在ノイズ補正戦略を適用して、フレーム間の時間的一貫性を改善する。
InvEdit マスクを用いて潜在ブレンドを導くことで背景を保持するか、選択的に変更する。

Figure 2 : Overview of GenVideo . Inflated attention layers are finetuned during source video finetuning. During inference, InvEdit predicts a region to edit and latent correction uses that mask to improve the inter-frame temporal consistency. $\mathcal{M}_{\phi}$ - “no mask”.

実験結果

リサーチクエスチョン

RQ1ターゲット画像のガイダンスは、ターゲットオブジェクトの形状・サイズがソースオブジェクトと異なる場合でも正確な編集を可能にするか。
RQ2InvEdit は形状認識可能なマスク局在化を提供し、動画編集に適用できるか。
RQ3形状変化を伴う編集で latent 補正戦略はフレーム間の時間的一貫性を改善できるか。

主な発見

CLIP-T	DINO	Temp	Text	Image	Visual
0.238	0.236	0.957	3.6	3.3	4.2
0.234	0.189	0.980	4.3	4.3	3.7
0.231	0.216	0.985	3.3	3.8	2.1
0.235	0.262	0.951	3.9	3.6	3.4
0.234	0.195	0.949	4.0	4.1	5.0
0.241	0.374	0.967	1.7	1.8	2.3

GenVideo はユーザ調査において、ターゲットテキストとターゲット画像の整合性で最先端のベースラインを上回る。
InvEdit マスクは編集の正確な形状認識による局在化を可能にし、適切な背景を維持する。
潜在補正はフレーム間の時間的一貫性を、フレーム間の特徴対応を用いた潜在のブレンドにより改善する。
GenVideo は車からバスへのような形状変化ターゲットに対するゼロショット画像編集を示し、一貫性を維持する。
定量的指標は GenVideo が CLIP-T および DINO スコアをより高く、テキストと画像の整合性に関するユーザー評価総和を低くすることを示す。

Figure 3 : InvEdit approach - the mask is generated by first iteratively computing noise differences across multiple timesteps for the source denoising branch and target denoising branch. Then, these differences are averaged and binarized to obtain the InvEdit mask.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。