QUICK REVIEW

[论文解读] DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

Chong Mou, Xintao Wang|arXiv (Cornell University)|Jul 5, 2023

Generative Adversarial Networks and Image Synthesis被引用 33

一句话总结

DragonDiffusion 通过来自特征对应的梯度引导，在预训练扩散模型上实现拖拽风格的图像编辑，无需微调，支持对象移动、调整大小、外观替换、粘贴以及内容拖拽。它利用 DDIM 反演中的记忆库和多尺度特征引导，结合可视化跨注意力以实现一致性。

ABSTRACT

Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion.

研究动机与目标

Motivate drag-style, fine-grained image editing beyond point dragging in diffusion models.
Convert editing operations into gradient guidance via feature correspondence in a pre-trained diffusion UNet.
Develop multi-scale guidance that combines semantic and geometric alignment of features.
Ensure content consistency with the original image using a memory-bank based visual cross-attention strategy.
Demonstrate editing capabilities across single-image and cross-image tasks without extra fine-tuning.

提出的方法

Represent editing as changes in feature correspondence within the pre-trained SD UNet denoiser.
Construct energy functions that convert editing targets into gradient guidance using cosine similarity of features from Gen and Gud memory banks.
Use DDIM inversion with a memory bank to store per-step latent features and attention keys/values for guidance.
Apply multi-scale guidance by combining second- and third-layer features for semantic and geometric alignment.
Implement visual cross-attention by substituting memory-bank keys/values into the UNet decoder’s attention, enabling cross-image consistency.
Optionally augment with an inpainting-like E_opt term to suppress artifacts in edited regions.

实验结果

研究问题

RQ1Can diffusion models achieve drag-style editing beyond point dragging without fine-tuning?
RQ2How can feature correspondence across diffusion-model layers be exploited for precise content editing and cross-image consistency?
RQ3What energy-function design and memory-bank strategy best enable semantic+geometric editing and maintain original content?
RQ4How does cross-attention with memory-bank features affect editing fidelity and artifact suppression?

主要发现

方法	准备复杂度	推理复杂度	不对齐的人脸	17 点	68 点	FID 17/68 点
UserControllableLT	1.2 s	0.05 s	✗	32.32	24.15	51.20/50.32
DragGAN	52.40s	6.71s	✗	15.96	10.60	39.27/39.50
DragDiffusion	48.25s	19.71s	✓	22.95	17.32	38.06 / 36.55
DragonDiffusion(ours)	3.62s	15.93s	✓	18.51	13.94	35.75 / 34.58

The method achieves drag-style editing via gradient guidance from feature correspondence without extra training.
Multi-scale guidance using second and third layer features balances semantic and geometric editing quality.
Memory-bank and visual cross-attention improve consistency between edited regions and the original image.
DragonDiffusion supports object moving, resizing, appearance replacing, object pasting, and content dragging with competitive stability.
Compared to DragGAN, DragDiffusion offers better content consistency and robustness in unaligned/multi-object scenarios.
On face manipulation tasks, DragonDiffusion demonstrates favorable trade-offs between editing accuracy, robustness, and consistency.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。