QUICK REVIEW

[论文解读] DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Hong Chen, Yipeng Zhang|arXiv (Cornell University)|May 5, 2023

Video Analysis and Summarization被引用 11

一句话总结

DisenBooth 引入面向对象驱动的 T2I 生成的身份保持式可 disentangled 调整，使用单独的文本身份保持嵌入和视觉身份不可相关嵌入以及辅助目标以获得更好的身份保留和提示保真度。

ABSTRACT

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability

研究动机与目标

通过解决主体身份与背景/姿态之间的 entanglement，推动改进的主体驱动文本到图像生成。
提出一个解耦调优框架，分别保持主体身份并捕捉身份不可相关信息。
开发辅助目标以在扩散模型微调过程中强制实现解耦。
使用适配器和 LoRA 实现参数高效微调。
展示相对于基线方法的生成质量与可控性的提升。

提出的方法

在去噪过程中对扩散模型进行微调。
通过特殊提示 P_s 和 CLIP 文本编码器提取身份保持的文本嵌入 f_s。
使用一个带有适配器的 CLIP 图像编码器按图像提取身份不可相关的视觉嵌入 f_i。
优化联合损失 L = L1 + L2 + L3，其中结合：带有 f_s+f_i 的精确去噪、带有 f_s 的弱去噪，以及一个对比嵌入目标以鼓励解耦。
针对 U-Net 及其适配器应用基于 LoRA 的参数高效微调，减少可训练参数数量。
在生成阶段，将 f_s 与文本提示结合以实现主体驱动输出，必要时混入 η f_i 以传递参考图像特征。

实验结果

研究问题

RQ1主体身份在扩散驱动的 T2I 生成中是否可以在允许灵活文本驱动定制的同时保持？
RQ2将身份保持的文本信息与身份不可相关的视觉信息解耦是否能提升提示保真度与身份保留？
RQ3参数高效微调（LoRA/适配器）是否能在解耦嵌入下实现有竞争力的结果？
RQ4提议的弱去噪与对比嵌入目标如何影响解耦与生成质量？

主要发现

DINO 分数	CLIP-T 分数	用户平均排名
0.675	0.330	1.589
0.362	0.352	-
0.605	0.303	2.893
0.546	0.318	3.072
0.685	0.319	2.445

相较基线，DisenBooth 在保持主体身份（DINO）同时具备较高文本提示保真度（CLIP-T）方面表现出色。
DisenBooth 在主观用户排名中优于 TI、DreamBooth 与 InstructPix2Pix。
消融实验表明 f_s 捕捉身份，f_i 捕捉背景/姿态特征，从而实现对参考特征的灵活继承。
将 f_s 与 η f_i 结合可实现对参考特征的可控转移，而不过度拟合背景。
微调大约需要 2.9M 参数（LoRA + 适配器），与全 U-Net 微调相比具备高效性。
DreamBench 实验显示 DisenBooth 在所有评估方法中实现了最佳的主体驱动生成表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。