QUICK REVIEW

[논문 리뷰] Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

Xuan Ju, Ailing Zeng|arXiv (Cornell University)|2023. 10. 02.

Generative Adversarial Networks and Image Synthesis인용 수 9

한 줄 요약

Direct Inversion은 소스 및 대상 확산 분기를 분리하여 편집을 가능하게 하고, 세 줄의 코드만으로 최적 보존 및 편집 충실도를 달성하며, PIE-Bench에서 검증되어 최적화 기반 역전 대비 강한 속도 향상을 보입니다.

ABSTRACT

Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

연구 동기 및 목표

확산 기반 이미지 편집에서 역전 전략의 필요성을 고무하고 최적화 기반 역전의 필요성을 이해한다.
필수 콘텐츠를 보존하면서 충실한 편집이 가능하도록 간단하고 바로 적용 가능한 역전 방법을 제안한다.
소스와 타깃 분기를 분리하는 것이 무거운 최적화 없이도 우수한 성능을 낸다는 것을 보여준다.
표준화된 벤치마크(PIE-Bench)와 강건한 평가를 제공하여 역전 기법을 비교한다.

제안 방법

소스와 타깃 확산 분기를 분리하여 서로 다른 역할을 부여한다: 소스의 보존과 타깃의 충실도.
전방 편집 과정에 세 줄의 코드를 추가하여 역전된 소스 잠재 변수와 앞으로 생성된 잠재 변수의 차이를 계산하고 이를 편집 체인에 다시 주입한다 (최적화 없음).
타깃 분기를 손대지 않아 편집 충실도를 극대화한다.
두 부분으로 수행: (a) DDIM Inversion으로 소스 이미지를 역전시키고; (b) 소스 잠재 차이를 순방향 DDIM 단계에 전파하여 Direct Inversion으로 편집을 실행한다.
표준화된 평가를 위한 700개의 이미지 편집 벤치마크 PIE-Bench를 도입하고 10가지 편집 유형과 주석(프롬프트, 마스크)을 제공합니다.

실험 결과

연구 질문

RQ1최적화 기반 역전이 편집 충실도나 콘텐츠 보존을 희생하지 않으면서 간단한 분리된 분기 접근법으로 대체될 수 있는가?
RQ2소스 잠재만 보정하고 타깃 분리를 손대지 않는 것이 편집 방법 전반에 걸쳐 안정성과 성능을 향상시키는가?
RQ3확산 기반 편집에서 세 줄의 코드 솔루션으로 얼마나 빠르고 정확도를 얻을 수 있는가?
RQ4표준화된 벤치마크(PIE-Bench)가 역전 방법의 공정한 평가에 미치는 영향은 무엇인가?

주요 결과

역전 방법	편집 방법	구조 거리 (×10^3) ↓	PSNR ↑	LPIPS (×10^3) ↓	MSE (×10^4) ↓	SSIM ×10^2 ↑	전체 CLIPSIM ↑	편집된 CLIPSIM ↑	메모
DDIM	P2P	69.43	17.87	208.80	219.88	71.14	25.01	22.44	--
NT	P2P	13.44	27.03	60.67	35.86	84.11	24.75	21.86	--
NP	P2P	16.17	26.21	69.01	39.73	83.40	24.61	21.87	--
StyleD	P2P	11.65	26.05	66.10	38.63	83.42	24.78	21.72	--
Ours	P2P	11.65	83%↓	27.22	54.55?	84.76	25.02	22.10	(Direct Inversion)
DDIM	MasaCtrl	28.38	22.17	106.62	86.97	79.67	23.96	21.16	--
Ours	MasaCtrl	24.70	22.64	87.94	81.09	81.33	24.38	21.35	(Direct Inversion)
DDIM	P2P-Zero	61.68	20.44	172.22	144.12	74.67	22.80	20.54	--
Ours	P2P-Zero	49.22	21.53	138.98	127.32	77.05	23.31	21.05	(Direct Inversion)
DDIM	PnP *	28.22	22.28	113.46	83.64	79.05	25.41	22.55	--
Ours	PnP *	24.29	22.46	106.06	80.45	79.68	25.41	22.62	(Direct Inversion)

Direct Inversion은 다섯 가지 역전 기법 전반에 걸쳐 여덟 가지 편집 방법보다 콘텐츠 보존 및 편집 충실도 모두에서 우수하다.
구조 거리에서 최대 83.2% 개선, 배경 LPIPS에서 최대 73.9% 개선, 편집 영역 CLIPSIM에서 최대 8.8% 증가를 달성한다.
해당 방법은 최적화 기반 역전들보다 거의 한 등급의 속도 향상을 달성한다(예: NT와 StyleDiffusion).
여덟 가지 편집 방식에 걸쳐 Direct Inversion은 콘텐츠 보존을 최대 20.2%, 편집 충실도를 최대 2.5% 향상시킨다.
PIE-Bench는 700장의 이미지와 10가지 편집 유형 및 주석을 제공하여 강건하고 표준화된 비교를 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.