QUICK REVIEW

[논문 리뷰] Cones 2: Customizable Image Synthesis with Multiple Subjects

Zhiheng Liu, Yifei Zhang|arXiv (Cornell University)|2023. 05. 30.

Generative Adversarial Networks and Image Synthesis인용 수 8

한 줄 요약

Cones 2는 주제별 잔여 토큰 임베딩과 레이아웃 유도 확산을 통해 모델 재훈련 없이 다수의 사용자가 지정한 주제를 가진 이미지를 효율적으로 합성하고, 강력한 성능과 확장성을 달성합니다.

ABSTRACT

Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

연구 동기 및 목표

실제 세계 응용에서 다주제 커스터마이즈 가능한 이미지 합성의 필요성을 제시합니다.
기본 텍스트 임베딩 위에 잔여 토큰 임베딩을 통해 효율적인 주제 표현을 제안합니다.
주제 배치를 제어하고 주제 간 간섭을 줄이기 위한 레이아웃 기반 공간 가이던스를 도입합니다.
원본 텍스트 임베딩을 보존하면서 주제별 잔여를 학습하는 학습 목표를 개발합니다.
여섯 주제로의 확장성과 최첨단 바이어스 대비 경쟁력 또는 우수한 성능을 입증합니다.

제안 방법

각 주제를 Delta_custom라는 잔여 토큰 임베딩으로 표현하여 기본 임베딩을 맞춤 주제로 이동시킵니다.
주제 보존 손실과 텍스트 임베딩 보존 정규화를 가진 주제별 텍스트 인코더를 학습하여 잔여 이동이 주제 토큰에 로컬라이즈되도록 합니다.
여러 자막에 주제가 포함된 경우의 평균 임베딩 차이를 통해 각 주제에 대한 Delta_custom를 계산합니다.
추론 시 입력 임베딩의 해당 주제 토큰에 Delta_custom 벡터를 더해 여러 주제의 합성을 수행합니다(모델 재훈련 없음).
레이아웃 사전 정보를 사용하여 교차 주의 활성화를 보정하고 목표 주제 영역을 강화하며 불필요한 영역을 약화시킵니다.
샘플링 중 교차 주의 맵을 레이아웃 유도 마스킹으로 편집하여 시간 단계 간 주제 위치를 제어합니다.

실험 결과

연구 질문

RQ1확산 모델을 재훈련 없이도 여러 사용자가 지정한 주제를 효율적으로 표현하고 결합하는 방법은 무엇인가?
RQ2기본 텍스트 임베딩 위의 간단한 잔여 임베딩이 신뢰할 수 있는 다주제 커스터마이즈와 구성을 지원할 수 있는가?
RQ3레이아웃 사전 정보를 교차 주의에 도입하는 것이 주제 배치와 속성 간섭을 개선하는가?
RQ4제안된 접근법은 더 많은 주제로 확장하고 의미적으로 유사한 주제를 처리하는가?
RQ5텍스트 정렬, 이미지 유사성 및 효율성 측면에서 최첨단 baselines 대비 상대적 성능은 어떠한가?

주요 결과

Method	Text Alignment	Image Alignment	Storage	Complexity
Single Subject DreamBooth	0.314	0.727	3.3 GB	O(n)
Single Subject Custom Diffusion	0.327	0.721	72 MB	O(n)
Single Subject Cones	0.331	0.722	(1.43 ± 0.34) MB	O(n)
Single Subject Ours	0.330	0.725	4.8 KB	O(n)
Two Subjects DreamBooth	0.278	0.664	3.3 GB	O(n^2)
Two Subjects Custom Diffusion	0.284	0.676	72 MB	O(n^2)
Two Subjects Cones	0.292	0.685	(3.41 ± 0.56) MB	O(n^2)
Two Subjects Ours	0.309	0.708	9.6 KB	O(n)
Three Subjects DreamBooth	0.252	0.649	3.3 GB	O(n^3)
Three Subjects Custom Diffusion	0.270	0.658	72 MB	O(n^3)
Three Subjects Cones	0.281	0.663	(4.96 ± 0.70) MB	O(n^3)
Three Subjects Ours	0.304	0.689	14.4 KB	O(n)
Four Subjects DreamBooth	0.241	0.604	3.3 GB	O(n^4)
Four Subjects Custom Diffusion	0.254	0.623	72 MB	O(n^4)
Four Subjects Cones	0.271	0.638	(7.75 ± 0.56) MB	O(n^4)
Four Subjects Ours	0.299	0.673	19.2 KB	O(n)

잔여 토큰 임베딩 접근법은 확산 모델 재훈련 없이 다중 주제를 유연하게 구성할 수 있게 한다.
텍스트 임베딩 보존 손실은 커스터마이제이션을 주제 토큰에 로컬라이즈하여 다주제 조합의 강건성을 가능하게 한다.
레이아웃 유도 교차 주의 보정은 주제 배치를 개선하고 주제 간 간섭을 줄인다.
이 방법은 여러 주제에 대해 DreamBooth, Custom Diffusion, Cones를 상회하거나 기존보다 경쟁력이 있으며, 도전적인 시나리오에서 최대 여섯 주제까지도 보여준다.
단일, 이중, 삼중, 사중 주제에 대해 제안된 방법은 저장소와 학습 복잡도가 훨씬 낮으면서 텍스트 및 이미지 정렬 성능이 경쟁력 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.