QUICK REVIEW

[논문 리뷰] DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Hong Chen, Yipeng Zhang|arXiv (Cornell University)|2023. 05. 05.

Video Analysis and Summarization인용 수 11

한 줄 요약

DisenBooth는 subject-driven T2I 생성에서 신원 보존 디엔탱글링 튜닝을 도입하며, 보조 목표를 사용해 자극적 신원 보존 및 프롬프트 적합성을 향상시키기 위해 텍스트-신원 보존 임베딩과 시각 신원 무관 임베딩을 분리합니다.

ABSTRACT

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability

연구 동기 및 목표

주체의 신원과 배경/포즈 간 얽힘 문제를 다룸으로써 주체 주도 텍스트-투-이미지 생성을 개선하려는 동기 부여.
주체 신원을 separately 보존하고 신원-무관 정보를 포착하는 디엔탱글링 튜닝 프레임워크를 제안합니다.
확산 모델 미세 조정 동안 디엔탱글먼트를 강제하는 보조 목표를 개발합니다.
Adapters와 LoRA를 사용한 매개변수 효율적 미세 조정을 달성합니다.
기준 방법 대비 향상된 생성 품질과 제어 가능성을 보입니다.

제안 방법

사전에 학습된 확산 모델을 사용하고 디노이징 과정 중에 미세 조정합니다.
특별 프롬프트 Ps와 CLIP 텍스트 인코더를 통해 신원 보존 텍스트 임베딩 f_s를 추출합니다.
Adapter가 강화된 CLIP 이미지 인코더를 사용하여 이미지당 신원-무관 시각 임베딩 f_i를 추출합니다.
공동 손실 L = L1 + L2 + L3를 최적화합니다. 이때 f_s+f_i로 정밀 디노이징, f_s만으로 약한 디노이징, 임베딩 간의 대조 목표를 통한 디엔탱글링 촉진을 포함합니다.
U-Net과 어댑터에 LoRA 기반 매개변수 효율적 미세 조정을 적용하여 학습 가능한 매개변수를 감소시킵니다.
생성 시 주체 주도 출력을 위해 f_s와 텍스트 프롬프트를 결합하고, 필요 시 η f_i를 혼합하여 참조 이미지 특성을 전달합니다.

실험 결과

연구 질문

RQ1확산 기반 T2I 생성에서 주체 신원을 보존하면서 텍스트 구동 맞춤화를 유연하게 허용할 수 있는가?
RQ2디엔탱글된 신원-보존 텍스트 정보와 신원-무관 시각 정보를 구분하는 것이 프롬프트 충실도와 신원 유지에 도움이 되는가?
RQ3매개변수 효율적 미세 조정(LoRA/어댑터)이 디엔탱글된 임베딩과 함께 경쟁력 있는 결과를 낼 수 있는가?
RQ4제안된 약한 디노이징 및 대조 임베딩 목표가 디엔탱글링 및 생성 품질에 어떤 영향을 미치는가?

주요 결과

DINO Score	CLIP-T Score	User Avg. Rank
0.675	0.330	1.589
0.362	0.352	-
0.605	0.303	2.893
0.546	0.318	3.072
0.685	0.319	2.445

DisenBooth는 baselines에 비해 텍스트 프롬프트 충실도(CLIP-T)와 주체 신원(DINO)을 동시에 높게 유지합니다.
DisenBooth는 주관적 사용자 순위에서 TI, DreamBooth, InstructPix2Pix를 능가합니다.
분석에서 f_s가 신원을, f_i가 배경/포즈 특성을 포착함을 확인하여 참조 특성의 유연한 상속이 가능함을 확인합니다.
f_s와 η f_i를 결합하면 배경에 과적합하지 않으면서 참조 특성의 제어 가능한 전달이 가능합니다.
미세 조정은 약 2.9M 매개변수(LoRA + 어댑터)에 불과하며 전체 U-Net 조정에 비해 효율적임이 나타납니다.
DreamBench 실험에서 제시된 방법들 중 DisenBooth가 주체 주도 생성에서 최적의 종합 성능을 보였습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.