[论文解读] HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
HyperDreamBooth 使用一个超网络来预测轻量级、低秩的个性化权重(LiDB),用于扩散模型,使在 ~20 秒内用 1 张图片实现主题特定的 T2I 个性化, 比 DreamBooth 快 25x,同时模型大小大约缩小 10,000x。
Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth - a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person's face in various contexts and styles, with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10,000x smaller than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io
研究动机与目标
- Motivate fast, memory-efficient personalization of text-to-image models without compromising subject fidelity or style diversity.
- Introduce Lightweight DreamBooth (LiDB) to drastically reduce personalized model size.
- Develop a HyperNetwork that predicts LiDB weights from a single subject image.
- Propose rank-relaxed fast finetuning to boost subject details after HyperNetwork initialization.
提出的方法
- Introduce Lightweight DreamBooth (LiDB) with a 30K-parameter, ~120 KB personalized weight space created via a random orthogonal incomplete basis inside a low-rank LoRA space.
- Present a HyperNetwork architecture with a ViT encoder and a transformer decoder that iteratively predicts LiDB weight residuals from a single face image.
- Train the HyperNetwork with a weight-space loss and a diffusion reconstruction loss on domain-specific images, using a simple supervisory prompt “a [V] face.”
- Perform iterative predictions of weight residuals to refine initialization, with the image encoding fixed after the first pass to speed up training and inference.
- Apply rank-relaxed finetuning by increasing LoRA rank during fast fine-tuning to capture high-frequency subject details.
- Demonstrate fast personalization on Stable Diffusion v1.5 by predicting cross- and self-attention layer residuals and the CLIP text encoder.
![Figure 1 : Using only a single input image, HyperDreamBooth is able to personalize a text-to-image diffusion model 25x faster than DreamBooth [ 25 ] , by using (1) a HyperNetwork to generate an initial prediction of a subset of network weights that are then (2) refined using fast finetuning for high](https://ar5iv.labs.arxiv.org/html/2307.06949/assets/x1.png)
实验结果
研究问题
- RQ1Can a hypernetwork predict a compact set of personalized weights that allow high-fidelity subject personalization in diffusion models from a single image?
- RQ2How does LiDB compare to DreamBooth and Textual Inversion in terms of size, speed, and fidelity?
- RQ3Does rank-relaxed finetuning enable higher subject fidelity without sacrificing speed?
- RQ4Is the approach robust across diverse subjects and stylistic prompts?
主要发现
| 方法 | 人脸识别 | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|
| Ours | 0.655 | 0.473 | 0.577 | 0.286 |
| DreamBooth | 0.618 | 0.441 | 0.546 | 0.282 |
| DreamBooth-Agg-1 | 0.615 | 0.323 | 0.431 | 0.313 |
| DreamBooth-Agg-2 | 0.616 | 0.360 | 0.467 | 0.302 |
| Textual Inversion | 0.623 | 0.289 | 0.472 | 0.277 |
- HyperDreamBooth achieves subject personalization in roughly 20 seconds, ~25x faster than DreamBooth and ~125x faster than Textual Inversion.
- The LiDB model is ~10,000x smaller than a standard DreamBooth model (~120 KB, ~30K trainable variables).
- HyperNetwork-guided initialization plus fast finetuning yields strong subject fidelity and consistent style diversity comparable to DreamBooth.
- Rank-relaxed finetuning improves detail capture by temporarily increasing LoRA rank, enabling higher subject fidelity while retaining fast runtime.
- Quantitative metrics show higher Face Rec., DINO, CLIP-I, and CLIP-T scores for HyperDreamBooth compared to DreamBooth and Textual Inversion in the reported experiments.
![Figure 2 : HyperDreamBooth Training and Fast Fine-Tuning. Phase-1: Training a hypernetwork to predict network weights from a face image, such that a text-to-image diffusion network outputs the person’s face from the sentence "a [v] face" if the predicted weights are applied to it. We use pre-compute](https://ar5iv.labs.arxiv.org/html/2307.06949/assets/x2.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。