QUICK REVIEW

[論文レビュー] InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Qian Wang, Biao Zhang|arXiv (Cornell University)|May 29, 2023

Multimodal Machine Learning Applications被引用数 9

ひとこと要約

InstructEditは言語処理とGrounded SAMマスクを用いて、ユーザー指示に基づく拷貝的な拡散ベースの画像編集を可能にし、マスク品質と結果を向上させつつ、細かな編集を実現する。

ABSTRACT

Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can accept multiple forms of user instructions as input. We provide the code at https://github.com/QianWangX/InstructEdit.

研究の動機と目的

ユーザーの指示からマスク作成を伴わずに、細かい画像編集を可能にする。
マルチオブジェクト画像における対象物の局在化と編集精度を向上させる。
事前学習済みの言語・セグメンテーション・拡散モデルを活用してパイプラインを自動化する。

提案手法

大規模言語モデルを用いてユーザー指示を解析し、セグメンテーションプロンプトと入力/編集キャプションを生成する。
セグメンテーションプロンプトに基づきGrounded Segment Anything (Grounded SAM)を用いて高品質なマスクを生成する。
入力キャプションと編集キャプションを用いて、マスクを用いたDDIM制御の拡散エディタ（Stable Diffusion）で画像を編集する。
DDIM逆変換を用いて入力画像をノイズテンソルにエンコードし、エンコード比率rで編集強度を制御する。
指示が不明瞭な場合にBLIP2を組み込み、画像を記述してプロンプトを改善することも可能。
LPIPS、CLIPスコア、CLIP方向類似度、およびユーザ研究で編集品質を評価する。

実験結果

リサーチクエスチョン

RQ1ユーザー指示を効果的に解析して、マスクなしでセグメンテーションと編集プロンプトを駆動できるか。
RQ2Grounded SAMによる grounding ベースのマスキングは、マスクなしベースラインと比較してマルチオブジェクト画像の細かな編集を改善するか。
RQ3単一オブジェクトおよび複数オブジェクトのシナリオにおいて、指示駆動編集は意味的保持と指示準拠の点でどう機能するか。

主な発見

Method	LPIPS ↓	CLIP score ↑	CLIP directional similarity ↑
MDP-ε_t	0.214	26.414	0.079
InstructPix2Pix	0.290	25.844	0.114
DiffEdit	0.167	26.847	0.106
InstructEdit	0.121	27.404	0.082

InstructEditは定量指標においてベースラインより意味的保持と指示整合性が優れる。
InstructEditはDiffEditよりマスク品質を改善し、複雑な場面での画像編集忠実度を高める。
Grounded SAMを通じて、対象物や領域を狙い撃ちして編集する際の overshootや誤局在を減少させる。
BLIP2支援プロンプトは、ユーザーの説明が曖昧または不完全な場合の編集品質を向上させる。
ユーザ研究では、10件の編集テストにおいてInstructEditがベースライン法より高い評価を得た。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。