QUICK REVIEW

[论文解读] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

Gwanghyun Kim, Kwon, Taesung|arXiv (Cornell University)|Oct 6, 2021

Image Processing Techniques and Applications被引用 38

一句话总结

DiffusionCLIP 使用经过 CLIP 指导微调的扩散模型来实现鲁棒的零样本文本驱动图像操作，包括看不见的领域和多属性变化。

ABSTRACT

Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines. Code is available at https://github.com/gwang-kim/DiffusionCLIP.git.

研究动机与目标

在多样真实图像上超越 GAN 反演的局限性，推动鲁棒的零样本图像操作。
利用扩散模型的反演与生成能力，在保持身份不变的同时对内容进行忠实编辑。
实现在看不见域之间的操控，并实现看不见域之间的翻译。
引入噪声组合方法，在单次采样过程中实现多属性操控。

提出的方法

使用预训练的扩散模型通过前向扩散（基于 DDIM/ODE）将输入图像映射到潜在噪声。
使用 CLIP 指导的损失对反向扩散模型进行微调，使属性朝向目标文本，同时保持身份不变。
采用定向 CLIP 损失来对齐图像和文本在 CLIP 空间中的方向，并辅以身份损失以防止不期望的变化。
利用确定性的前向和反向 DDIM 采样实现近乎完美的反演和受控生成。
引入快速采样策略，包含返回步数和减少前向/生成步数，以在质量和速度之间取得平衡。
在采样期间通过线性组合来自多个微调模型的噪声实现多属性迁移。

实验结果

研究问题

RQ1基于扩散的反演是否能在有文本提示的情况下，对域内和域外的真实图像进行保真操控？
RQ2该方法是否能够在看不见域之间进行翻译并从笔画或其他输入生成看不见域的图像？
RQ3将来自多个微调模型的噪声进行组合是否能在单次采样过程中实现多属性操控？
RQ4在重建质量、速度和属性控制之间取得平衡的最佳采样超参数是什么？

主要发现

DiffusionCLIP 实现接近完美的重建质量，在 MAE、LPIPS 和 SSIM 指标上超越 GAN 反演基线。
它能够将真实图像操作至看不见域并在看不见域之间进行翻译，在定性与人工评估中优于基线。
带身份约束的定向 CLIP 损失在实现稳健属性控制的同时，具有较高的分割一致性和身份保持。
通过在单次采样步骤中组合来自多个微调模型的噪声，可以实现多属性迁移。
采用带返回步数的快速采样模式和减少步数在保真度损失有限的情况下提供实用的速度提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。