QUICK REVIEW

[論文レビュー] DiffEdit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek|arXiv (Cornell University)|Oct 20, 2022

Generative Adversarial Networks and Image Synthesis参考文献 54被引用数 102

ひとこと要約

tldr: DiffEdit は、DDIM エンコーディングと拡散モデルの差分を用いてテキスト誘導の意味的画像編集の領域マスクを自動推定し、マスクを手動で用意することなく局所的な編集を可能にし、ImageNet、COCO、および Imagen 生成画像で強力な結果を達成します。

ABSTRACT

Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.

研究の動機と目的

入力画像のできる限り多くを保持しつつテキスト変換を適用する意味的画像編集を動機づける。
異なるテキストのもとで拡散モデルの予測から編集領域を自動的に推定して、ユーザー提供のマスクを不要にする。
編集領域内の入力内容をより良く保持するために DDIM エンコーディングを活用する。
マスク案内と条件付き拡散を組み合わせて高品質で自然な編集を実現する。
従来の拡散ベース編集法に対する利点を理論的・実証的に分析する。

提案手法

編集テキスト Q と参照/空テキストの下でノイズ推定を比較して編集マスク M を推定するテキスト条件付き拡散モデルを使用する。
入力画像を DDIM エンコードして無条件モデル（テキストなし）で潜在 y_r にエンコードする。
編集テキスト Q で条件付けして推定マスクを用いて背景ピクセルをエンコード済み潜在 x_t に置換し、局所的な編集を生み出す。
マスク案内の DDIM 更新を統合する： y_t' = M y_t + (1 - M) x_t、エンコード比 r を用いてデノイジングステップ数を設定して編集強度を制御する。
実際の Lipschitz 条件と境界仮定の下で、DiffEdit の DDIM エンコード編集と SDEdit のノイズ追加の間に理論的比較（命題 1）を提供し、無条件/条件付きノイズ推定が類似している場合により厳密な境界を説明する。

実験結果

リサーチクエスチョン

RQ1異なるテキストプロンプトの下での予測を対比することで、ユーザー提供マスクなしに局所的な領域の編集へ拡散モデルを誘導できるか？
RQ2入力画像を DDIM エンコードでエンコードすることは、 appearance を保持し、単純なノイズ追加と比べて編集の滑らかな統合を促進しますか？
RQ3DDIM エンコードマスキングを使用した場合、編集強度と元画像への忠実度の間にどのようなトレードオフが生じるか？
RQ4ImageNet、COCO、Imagen 生成画像などのデータセットで DiffEdit は従来の拡散ベース編集法と比べてどう機能するか？
RQ5参照テキストは実務的にマスク品質と編集成果を改善しますか？

主な発見

DiffEdit は ImageNet における従来の拡散ベース法と比較して最先端の編集性能を達成した。
推定マスクと DDIM エンコードは、ImageNet、COCO、Imagen 生成画像の各データセットで SDEdit および他のベースラインよりも CSFID–LPIPS のトレードオフを改善する。
アブレーションにより、マスキングと DDIM エンコードの双方が独立して結果を改善し、それらの組み合わせが最善のトレードオフを提供する。
参照テキスト（元画像のキャプション）を用いてマスクを計算すると、特に Imagen データで、クエリと参照が異なる領域に変更を集中させることで編集がより良くなることが多い。
理論的分析（命題 1）では、DDIM エンコード編集が、現実的な Lipschitz 条件と境界仮定の下で、単純なノイズベースの SDEdit より入力画像への編集距離のより厳密な境界を与えることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。