QUICK REVIEW

[논문 리뷰] HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Nataniel Ruiz, Yuanzhen Li|arXiv (Cornell University)|2023. 07. 13.

Generative Adversarial Networks and Image Synthesis인용 수 14

한 줄 요약

HyperDreamBooth는 하이퍼네트워크를 사용해 확산 모델에 대한 가볍고 저랭크의 개인화 가중치(LiDB)를 예측하여 주제별 T2I 개인화를 약 20초 안에 1장의 이미지로 가능하게 하며, DreamBooth보다 25배 빠르고, 모델은 약 10,000배 더 작게 만든다.

ABSTRACT

Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth - a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person's face in various contexts and styles, with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10,000x smaller than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io

연구 동기 및 목표

빠르게 메모리 효율적인 텍스트-투-이미지 모델의 개인화를 주제 충실도나 스타일 다양성의 타협 없이 촉진한다.
개인화된 모델 크기를 대폭 줄이기 위해 Lightweight DreamBooth (LiDB)를 소개한다.
단일 주제 이미지에서 LiDB 가중치를 예측하는 하이퍼네트워크를 개발한다.
HyperNetwork 초기화 후 주제 세부 정보를 향상시키기 위한 랭크-relaxed 빠른 미세조정을 제안한다.

제안 방법

랜덤 직교 불완전 기저를 저랭크 LoRA 공간 내에 만들어진 30K 매개변수, ~120 KB 개인화 가중치 공간을 가진 Lightweight DreamBooth (LiDB)를 도입한다.
단일 얼굴 이미지에서 LiDB 가중치 잔차를 반복적으로 예측하는 ViT 인코더와 트랜스포머 디코더로 구성된 HyperNetwork 아키텍처를 제시한다.
도메인 특화 이미지에서 가중치 공간 손실과 확산 재구성 손실로 하이퍼네트워크를 학습시키고, 간단한 감독 프롬프트 “a [V] face.”를 사용한다.
초기화를 다듬기 위해 가중치 잔차를 반복적으로 예측하며, 첫 번째 패스 이후 이미지 인코딩을 고정하여 학습 및 추론 속도를 높인다.
빠른 미세조정을 통해 LoRA 순위를 증가시켜 고주파 주제 세부 정보를 포착하도록 랭크-relaxed 미세조정을 적용한다.
Stable Diffusion v1.5에서 교차- 및 자기-주의 잔차와 CLIP 텍스트 인코더를 예측하여 빠른 개인화를 시연한다.

Figure 1 : Using only a single input image, HyperDreamBooth is able to personalize a text-to-image diffusion model 25x faster than DreamBooth [ 25 ] , by using (1) a HyperNetwork to generate an initial prediction of a subset of network weights that are then (2) refined using fast finetuning for high

실험 결과

연구 질문

RQ1하이퍼네트워크가 단일 이미지로부터 확산 모델에서 고충실도 주제 개인화를 가능하게 하는 컴팩트한 개인화 가중치를 예측할 수 있는가?
RQ2LiDB가 크기, 속도 및 충실도 측면에서 DreamBooth 및 Textual Inversion과 어떻게 비교되는가?
RQ3랭크-relaxed 미세조정이 속도를 희생하지 않고 더 높은 주제 충실도를 가능하게 하는가?
RQ4다양한 주제와 스타일 프롬프트에 대해 이 접근법이 견고한가?

주요 결과

방법	얼굴 인식	DINO	CLIP-I	CLIP-T
Ours	0.655	0.473	0.577	0.286
DreamBooth	0.618	0.441	0.546	0.282
DreamBooth-Agg-1	0.615	0.323	0.431	0.313
DreamBooth-Agg-2	0.616	0.360	0.467	0.302
Textual Inversion	0.623	0.289	0.472	0.277

HyperDreamBooth는 주제 개인화를 약 20초 만에 달성하며, DreamBooth보다 약 25배 빠르고 Textual Inversion보다 약 125배 빠르다.
LiDB 모델은 표준 DreamBooth 모델보다 약 10,000배 작으며(~120 KB, ~30K 훈련 가능 변수).
HyperNetwork 기반 초기화와 빠른 미세조정은 DreamBooth에 비해 유사한 주제 충실도와 일관된 스타일 다양성을 제공한다.
랭크-relaxed 미세조정은 LoRA 순위를 일시적으로 증가시켜 디테일 포착을 향상시키고 빠른 런타임을 유지하는 동안 높은 주제 충실도를 가능하게 한다.
정량적 지표는 제시된 실험에서 HyperDreamBooth가 DreamBooth 및 Textual Inversion에 비해 Face Rec., DINO, CLIP-I, CLIP-T 점수가 더 높음을 보여준다.

Figure 2 : HyperDreamBooth Training and Fast Fine-Tuning. Phase-1: Training a hypernetwork to predict network weights from a face image, such that a text-to-image diffusion network outputs the person’s face from the sentence "a [v] face" if the predicted weights are applied to it. We use pre-compute

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.