QUICK REVIEW

[論文レビュー] Imagic: Text-Based Real Image Editing with Diffusion Models

Bahjat Kawar, Shiran Zada|arXiv (Cornell University)|Oct 17, 2022

Generative Adversarial Networks and Image Synthesis被引用数 34

ひとこと要約

Imagicは、事前学習済みの拡散モデルを用いて、単一の実画像に対して複雑で非剛性なテキストベースの編集を実現します。テキスト埋め込みを最適化し、モデルをファインチューニングし、埋め込みを内挿して忠実性とターゲットテキストの整合性をバランスさせます。

ABSTRACT

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

研究の動機と目的

補助入力なしで、単一の高解像度実画像に対してテキストベースのセマンティック編集を可能にする。
画像の忠実度を保ちながら、ターゲットテキストに合わせた複雑な非剛性編集（ポーズ、構図）を実現する。
入力画像表現とターゲット編集の間で、意味的に有意な埋め込み補間を実証する。
テキスト編集手法を評価するための難易度の高いベンチマーク（TEdBench）を導入する。

提案手法

テキストプロンプトで条件付けされた実画像を編集するため、事前学習済みのテキスト-to-画像拡散モデルを使用する。
ノイズ除去目的で入力画像を再構築するよう、ターゲットテキスト埋め込みを最適化する。
最適化された埋め込みで入力画像により適合するよう、拡散モデル（および補助的アップスケーラ）をファインチューニングする。
最適化された画像埋め込みとターゲットテキスト埋め込みの線形補間によって、編集埋め込みを得る。
補間埋め込みで条件付けした拡散過程を実行して編集済み画像を生成する。必要に応じて超解像を適用する。

実験結果

リサーチクエスチョン

RQ1複雑な非剛性編集を、テキストプロンプトと1つの入力画像だけを用いて、単一の実画像に適用できるか？
RQ2テキスト埋め込みを最適化し、拡散モデルをファインチューニングすることで、元画像への高忠実度とターゲットテキストの整合性を同時に達成できるか？
RQ3画像表現埋め込みとターゲット編集埋め込みの線形補間は、編集において意味論的に有意か？
RQ4難易度の高いベンチマークで、Imagicは既存の単一画像編集手法と比べてどの程度性能を示すか？
RQ5さまざまな編集強度（eta）が忠実度とテキスト整合性に与える影響は？

主な発見

Imagicは、複雑な編集に対してターゲットテキストへ整合させつつ、入力画像への高い忠実度を達成する。
この手法は、実画像における姿勢や構図の変更などを、単一のフレームワーク内で実現する。
TEdBenchは、編集品質においてSDEdit、DDIB、Text2LIVEよりImagicを人間評価者が好むことを示し（>70%の好評価）。
質の高い編集には、埋め込み最適化、モデルファインチューニング、埋め込み補間という三段階プロセスが不可欠である。
入力画像の再構築と意味のある補間を可能にするためには、拡散モデルのファインチューニングが重要である。
このアプローチは、ImagenとStable Diffusionを用いて、さまざまなドメインで実証されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。