QUICK REVIEW

[논문 리뷰] Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady|arXiv (Cornell University)|2022. 08. 02.

Modular Robots and Swarm Intelligence인용 수 361

한 줄 요약

이 논문은 마스크 없이 원본 구조를 보존하면서 프롬프트를 편집하여 이미지 편집을 수행하는 텍스트 전용 Prompt-to-Prompt 편집 프레임워크를 제시한다.

ABSTRACT

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.

연구 동기 및 목표

사용자 제공 마스크나 추가 학습 없이 직관적인 텍스트 기반 이미지 편집을 가능하게 하는 동기 부여.
프롬프트 토큰과 이미지 영역 사이의 의미적 다리로서 교차 주의(attention) 레이어를 조사합니다.
주의 조작을 통해 이미지 구조를 변화시키거나 보존하는 프롬프트 기반 편집 연산의 개발.

제안 방법

텍스트 조건부 확산 모델에서 교차 주의(attention)를 분석하여 프롬프트 토큰과 공간적 이미지 영역 간의 연결을 파악합니다.
레이아웃을 보존하기 위해 편집된 프롬프트로 확산 중 소스 이미지의 주의 맵을 주입하고 재정의합니다.
제어된 주의 주입을 통해 확산 단계 전반에 걸쳐 편집 연산(단어 교체, 구문 추가, 주의 재가중)을 정의합니다.
공유 난수성을 가진 소스 프롬프트와 편집된 프롬프트에 대해 반복적인 확산 기반 알고리즘으로 편집을 수행합니다.
부분 토큰 정렬 및 타임스탬프 주입을 다루는 메커니즘을 도입하여 충실도와 편집 가능성의 균형을 맞춥니다.

실험 결과

연구 질문

RQ1텍스트-에서-이미지 확산 모델에서 교차 주의 맵을 어떻게 활용해 편집 중 공간적 레이아웃을 제어할 수 있을까?
RQ2마스크나 재학습 없이 프롬프트를 편집해 국지적이거나 전역적인 이미지 편집을 달성할 수 있을까?
RQ3원래 구성을 보존하면서 프롬프트 변화를 적용하기 위한 효과적인 전략(주입 시점, 소프트 제약 등)은 무엇인가?
RQ4역기를 통한 실제 이미지에 대해 제안된 방법의 성능은 어떠하며 한계는 무엇인가?
RQ5편집된 프롬프트에 대한 충실도와 소스 이미지 구조 유지 사이의 균형은 어떻게 달라지는가?

주요 결과

교차 주의 맵은 픽셀과 프롬프트 단어를 밀접하게 연결하고 이를 조작해 이미지 레이아웃을 제어할 수 있다.
소스 주의 맵을 편집된 프롬프트에 주입하면 구성 보존과 의미적 변화를 동시에 가능하게 한다.
타임스탬프 tau를 통한 부드럽고 부분적인 주의 주입은 과도한 제약을 완화하고 편집 가능성을 유지한다.
본 방법은 단어 교체, 새로운 구의 추가, 주의 재가중을 통해 미세한 제어를 지원한다.
역인을 통한 실제 이미지 편집이 가능하다는 예비 결과와 재구성 갭을 해결하는 마스크 기반 보정이 제시된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.