[论文解读] Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code
Direct Inversion 将源扩散分支和目标扩散分支解耦以实现编辑,在仅用三行代码的情况下实现最佳的内容保留与编辑保真度,并在 PIE-Bench 上验证,相较于基于优化的反演具有显著的加速。
Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.
研究动机与目标
- Motivate inversion strategies for diffusion-based image editing and understand the necessity of optimization-based inversion.
- Propose a simple, plug-and-play inversion method that preserves essential content while enabling faithful edits.
- Show that disentangling source and target branches yields superior performance without heavy optimization.
- Provide a standardized benchmark (PIE-Bench) and robust evaluation to compare inversion techniques.
提出的方法
- Disentangle source and target diffusion branches to assign distinct roles: preservation for the source and fidelity for the target.
- Add three lines of code to the forward-editing process to compute and inject the difference between the inverted source latent and the forward-generated latent back into the editing chain (no optimization).
- Keep the target branch untouched to maximize edit fidelity.
- Perform a two-part procedure: (a) invert the source image via DDIM Inversion; (b) execute editing with Direct Inversion by propagating the source latent difference through the forward DDIM steps.
- Introduce PIE-Bench, a 700-image editing benchmark with 10 editing types and annotations (prompts, mask) for standardized evaluation.
实验结果
研究问题
- RQ1Can optimization-based inversion be replaced by a simple, disentangled-branch approach without sacrificing edit fidelity or content preservation?
- RQ2Does keeping the target branch untouched while correcting only the source latent improve stability and performance across editing methods?
- RQ3How much speed and accuracy can be gained by a plug-and-play three-line code solution in diffusion-based editing?
- RQ4What is the impact of a standardized benchmark (PIE-Bench) on fair evaluation of inversion methods?
主要发现
| Inversion Method | Editing Method | Structure Distance (×10^3) ↓ | PSNR ↑ | LPIPS (×10^3) ↓ | MSE (×10^4) ↓ | SSIM ×10^2 ↑ | Whole CLIPSIM ↑ | Edited CLIPSIM ↑ | Notes |
|---|---|---|---|---|---|---|---|---|---|
| DDIM | P2P | 69.43 | 17.87 | 208.80 | 219.88 | 71.14 | 25.01 | 22.44 | -- |
| NT | P2P | 13.44 | 27.03 | 60.67 | 35.86 | 84.11 | 24.75 | 21.86 | -- |
| NP | P2P | 16.17 | 26.21 | 69.01 | 39.73 | 83.40 | 24.61 | 21.87 | -- |
| StyleD | P2P | 11.65 | 26.05 | 66.10 | 38.63 | 83.42 | 24.78 | 21.72 | -- |
| Ours | P2P | 11.65 | 83%↓ | 27.22 | 54.55? | 84.76 | 25.02 | 22.10 | (Direct Inversion) |
| DDIM | MasaCtrl | 28.38 | 22.17 | 106.62 | 86.97 | 79.67 | 23.96 | 21.16 | -- |
| Ours | MasaCtrl | 24.70 | 22.64 | 87.94 | 81.09 | 81.33 | 24.38 | 21.35 | (Direct Inversion) |
| DDIM | P2P-Zero | 61.68 | 20.44 | 172.22 | 144.12 | 74.67 | 22.80 | 20.54 | -- |
| Ours | P2P-Zero | 49.22 | 21.53 | 138.98 | 127.32 | 77.05 | 23.31 | 21.05 | (Direct Inversion) |
| DDIM | PnP * | 28.22 | 22.28 | 113.46 | 83.64 | 79.05 | 25.41 | 22.55 | -- |
| Ours | PnP * | 24.29 | 22.46 | 106.06 | 80.45 | 79.68 | 25.41 | 22.62 | (Direct Inversion) |
- Direct Inversion outperforms eight editing methods across five inversion techniques in both content preservation and edit fidelity.
- It yields up to 83.2% improvement in structure distance and up to 73.9% improvement in background LPIPS, with up to 8.8% gains in Edit Region CLIPSIM.
- The method achieves nearly an order of magnitude speedup over optimization-based inversions (e.g., NT and StyleDiffusion).
- Across eight editing approaches, Direct Inversion boosts content preservation by up to 20.2% and edit fidelity by up to 2.5%.
- PIE-Bench provides 700 images across 10 editing types with annotations to enable robust, standardized comparisons.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。