QUICK REVIEW

[论文解读] Cones 2: Customizable Image Synthesis with Multiple Subjects

Zhiheng Liu, Yifei Zhang|arXiv (Cornell University)|May 30, 2023

Generative Adversarial Networks and Image Synthesis被引用 8

一句话总结

Cones 2 引入面向主题的残差令牌嵌入和布局引导的扩散，能够在不重新训练的情况下灵活组合多个用户指定主题以生成多主题图像，性能与扩展性出色。

ABSTRACT

Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

研究动机与目标

在真实世界应用中推动多主题可定制图像合成。
提出在基础文本嵌入之上通过残差令牌嵌入实现高效的主题表示。
引入基于布局的空间引导以控制主题放置并降低主题之间的干扰。
开发一个训练目标，在学习主题特定残差的同时保留原始文本嵌入。
demostrated 能扩展到六个主题并在与现有最先进基线的对比中具有竞争力或更优性能。

提出的方法

用残差令牌嵌入 Delta_custom 表示每个主题，将基础嵌入向自定义主题偏移。
通过主题保持损失和文本嵌入保持正则化来本地化残差对主题令牌的偏移，训练一个面向主题的文本编码器。
对每个主题计算 Delta_custom，为包含该主题的大量标题中嵌入差值的平均值。
在推理阶段，将多个 Delta_custom 向量叠加到输入嵌入中的相应主题令牌上实现组合（无需重新训练模型）。
使用布局先验通过校正跨注意力激活来引导放置，加强目标主题区域、削弱无关区域。
在采样期间通过基于布局的屏蔽编辑跨注意力映射，以控制跨时间步的主题位置。

实验结果

研究问题

RQ1我们如何在不重新训练扩散模型的情况下高效表示并组合多主题的用户指定主题？
RQ2在基础文本嵌入之上简单的残差嵌入是否能支持可靠的多主题定制与组合？
RQ3引入布局先验以引导跨注意力是否能改善主题放置并减少属性干扰？
RQ4方法在更多主题下的扩展性如何，以及对语义相似主题的处理？
RQ5在文本对齐、图像相似性和效率方面，与最先进基线相比的性能如何？

主要发现

Method	Text Alignment	Image Alignment	Storage	Complexity
Single Subject DreamBooth	0.314	0.727	3.3 GB	O(n)
Single Subject Custom Diffusion	0.327	0.721	72 MB	O(n)
Single Subject Cones	0.331	0.722	(1.43 ± 0.34) MB	O(n)
Single Subject Ours	0.330	0.725	4.8 KB	O(n)
Two Subjects DreamBooth	0.278	0.664	3.3 GB	O(n^2)
Two Subjects Custom Diffusion	0.284	0.676	72 MB	O(n^2)
Two Subjects Cones	0.292	0.685	(3.41 ± 0.56) MB	O(n^2)
Two Subjects Ours	0.309	0.708	9.6 KB	O(n)
Three Subjects DreamBooth	0.252	0.649	3.3 GB	O(n^3)
Three Subjects Custom Diffusion	0.270	0.658	72 MB	O(n^3)
Three Subjects Cones	0.281	0.663	(4.96 ± 0.70) MB	O(n^3)
Three Subjects Ours	0.304	0.689	14.4 KB	O(n)
Four Subjects DreamBooth	0.241	0.604	3.3 GB	O(n^4)
Four Subjects Custom Diffusion	0.254	0.623	72 MB	O(n^4)
Four Subjects Cones	0.271	0.638	(7.75 ± 0.56) MB	O(n^4)
Four Subjects Ours	0.299	0.673	19.2 KB	O(n)

残差令牌嵌入方法实现了在不重新训练扩散模型的情况下对多主题进行灵活组合。
文本嵌入保持损失将定制局部化到主题令牌，提升多主题组合的鲁棒性。
基于布局的跨注意力矫正提升主题放置并减少主题之间的干扰。
该方法在多主题、包括在具有挑战性的情形下最多六个主题的场景中，优于 DreamBooth、Custom Diffusion 与 Cones。
对于单主题、两主题、三主题和四主题，所提出的方法在文本和图像对齐方面具有竞争力且存储与训练复杂度显著较低。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。