QUICK REVIEW

[논문 리뷰] Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Wan-Duo Kurt Ma, J. P. Lewis|arXiv (Cornell University)|2023. 02. 25.

Generative Adversarial Networks and Image Synthesis인용 수 10

한 줄 요약

Directed Diffusion은 사용자가 제공한 경계 상자를 이용해 교차 어텐션 맵을 편집함으로써 텍스트-guided 확산에서 지향 객체에 대한 거친 위치 제어를 추가하고, 모델 학습 없이도 작동합니다.

ABSTRACT

Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.

연구 동기 및 목표

diffusion 생성 장면에서 명시적 객체 배치를 가능하게 하여 스토리텔링과 구성의 의도를 촉진한다.
단일 이미지 또는 관련 이미지들 간에 다수 객체의 위치를 제어하는 간단하고 학습이 필요 없는 방법을 제공한다.
배치된 객체와 배경 사이의 일관성을 확산 과정에서 유지한다.
사전에 학습된 확산 모델과 통합되는 가벼운 구현체를 제공한다.
전용 재훈련 없이도 오픈 세트의 제로-샷 배치를 지원한다.

제안 방법

프리트레인드 확산 모델의 교차 어텐션 맵을 활용하여 프롬프트 단어를 공간적 영역과 연관 짓는다.
디노이징 프로세스 초기에 객체 위치 지도를 안내하기 위해 바운딩 박스와 방향성 프롬프트 단어를 정의한다.
바운딩 박스 내부의 가우시안 가중치로 후행 교차 어텐션 맵을 모듈화하는 어텐션 편집 단계를 도입한다.
학습된 텍스트-이미지 매핑을 보존하면서 Directed 어텐션 맵을 가우시안 타깃 맵과 정렬시키기 위해 작은 가중치 벡터 a를 최적화한다.
초기 디노이징 단계에서 어텐션 편집과, 그 후 분류자-프리 가이던스를 이용한 기존 확산 디노이징의 이중 단계 파이프라인을 사용한다.
최소한의 코드 변경과 모델 미세조정 없이 배치 및 상호 작용 제어를 가능하게 한다.

Figure 1: Directed Diffusion (DD) augments denoising diffusion text-to-image generation by allowing the position of specified objects to be controlled with user-specified bounding boxes (highlighted in red). (Left) DD generates specified objects (insect robot, cat) placed according to the given boun

실험 결과

연구 질문

RQ1재훈련 없이 사전 학습된 확산 모델에서 바운딩 박스 기반의 거친 지시가 특정 객체의 배치를 이끌 수 있는가?
RQ2어텐션 기반 배치가 여러 지향 객체와 장면과의 상호 작용을 얼마나 잘 다루는가?
RQ3배치된 객체와 배경 사이의 맥락적 일관성(조명, 그림자 등)을 방법이 유지하는가?
RQ4Directed Diffusion이 사용 편의성과 품질 측면에서 기존의 오픈 세트 배치 방법과 비교해 어떤 차이가 있는가?

주요 결과

이 방법은 사전 학습 모델의 미세 조정 없이도 다수 객체에 대한 쉽고 고수준의 위치 제어를 제공한다.
배치된 객체가 배경과 일관되게 통합되고 그림자와 같은 맥락적 상호 작용을 보인다.
가우시안 타깃 맵과 일치시키기 위해 바운딩 박스 내에서 Directed 교차 어텐션 맵에 대한 가중치 벡터의 작은 최적화를 사용한다.
파이프라인은 구현에 몇 줄의 코드만 필요하고 사전 학습된 모델의 텍스트-이미지 정합성을 보존한다.
다양한 장면이나 프레임에서Directed 객체의 배치 및 상호 작용을 가능하게 하므로 구성성과 스토리텔링 능력이 향상된다.
실험에서 프롬프트와 합성 이미지 간의 CLIP 기반 유사도가 경쟁 방법과 비슷하거나 더 우수한 것으로 보고되었다(논문에 보고된 바와 같이).

Figure 2: (Top, from left to right): The reverse SD denoising process from the initial stage to the end of process. Note that the position of the cat is evident early in the process (red box), however the details that define it as a cat are not yet clear. (Bottom): The cross-attention maps associate

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.