QUICK REVIEW

[論文レビュー] DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

Chong Mou, Xintao Wang|arXiv (Cornell University)|Jul 5, 2023

Generative Adversarial Networks and Image Synthesis被引用数 33

ひとこと要約

DragonDiffusionは、特徴対応からの勾配指示によって、ファインチューニングなしで事前訓練済み拡散モデル上でドラッグ風の画像編集を実現します。オブジェクトの移動、サイズ変更、外観の差し替え、貼り付け、内容のドラッグをサポートします。DDIM inversionのメモリーバンクと視覚的クロスアテンションを活用して一貫性を確保します。

ABSTRACT

Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion.

研究の動機と目的

Motivate drag-style, fine-grained image editing beyond point dragging in diffusion models.
Convert editing operations into gradient guidance via feature correspondence in a pre-trained diffusion UNet.
Develop multi-scale guidance that combines semantic and geometric alignment of features.
Ensure content consistency with the original image using a memory-bank based visual cross-attention strategy.
Demonstrate editing capabilities across single-image and cross-image tasks without extra fine-tuning.

提案手法

Represent editing as changes in feature correspondence within the pre-trained SD UNet denoiser.
Construct energy functions that convert editing targets into gradient guidance using cosine similarity of features from Gen and Gud memory banks.
Use DDIM inversion with a memory bank to store per-step latent features and attention keys/values for guidance.
Apply multi-scale guidance by combining second- and third-layer features for semantic and geometric alignment.
Implement visual cross-attention by substituting memory-bank keys/values into the UNet decoder’s attention, enabling cross-image consistency.
Optionally augment with an inpainting-like E_opt term to suppress artifacts in edited regions.

実験結果

リサーチクエスチョン

RQ1Can diffusion models achieve drag-style editing beyond point dragging without fine-tuning?
RQ2How can feature correspondence across diffusion-model layers be exploited for precise content editing and cross-image consistency?
RQ3What energy-function design and memory-bank strategy best enable semantic+geometric editing and maintain original content?
RQ4How does cross-attention with memory-bank features affect editing fidelity and artifact suppression?

主な発見

Method	Preparing complexity	Inference complexity	Unaligned face	17 Points	68 Points	FID 17/68 points
UserControllableLT	1.2 s	0.05 s	✗	32.32	24.15	51.20/50.32
DragGAN	52.40s	6.71s	✗	15.96	10.60	39.27/39.50
DragDiffusion	48.25s	19.71s	✓	22.95	17.32	38.06 / 36.55
DragonDiffusion(ours)	3.62s	15.93s	✓	18.51	13.94	35.75 / 34.58

The method achieves drag-style editing via gradient guidance from feature correspondence without extra training.
Multi-scale guidance using second and third layer features balances semantic and geometric editing quality.
Memory-bank and visual cross-attention improve consistency between edited regions and the original image.
DragonDiffusion supports object moving, resizing, appearance replacing, object pasting, and content dragging with competitive stability.
Compared to DragGAN, DragDiffusion offers better content consistency and robustness in unaligned/multi-object scenarios.
On face manipulation tasks, DragonDiffusion demonstrates favorable trade-offs between editing accuracy, robustness, and consistency.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。