QUICK REVIEW

[论文解读] Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Weixi Feng, Xuehai He|arXiv (Cornell University)|Dec 9, 2022

Generative Adversarial Networks and Image Synthesis被引用 70

一句话总结

一种无需训练的方法，将结构化语言引导注入 Stable Diffusion 的交叉注意力，以在 T2I 中提高属性绑定和组合性，并通过新的 ABC-6K 和 CC-500 基准进行评估。

ABSTRACT

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.

研究动机与目标

在 T2I 输出中改善对象与属性之间的属性绑定。
在不使用额外训练数据的情况下提升多对象场景的组合生成。
利用结构化语言表示来引导扩散模型的交叉注意力。
引入基准来量化组合性和绑定准确性（ABC-6K）。

提出的方法

使用成分句法分析或场景图从提示中提取多个名词短语。
用冻结的 CLIP 文本编码器对每个文本片段进行编码，并将嵌入与完整提示序列重新对齐。
通过使用注意力映射将文本片段语义映射到关注的图像区域来修改交叉注意力。
从所有结构化文本片段计算并融合基于注意力的值向量到扩散引导中（方程1–4）。
引入一个变体，对连接提示聚合多个注意力映射（方程5–6）。
演示无需训练即可与 Stable Diffusion 无额外数据集成。

实验结果

研究问题

RQ1结构化的交叉注意力引导是否能提高 T2I 生成中的属性-对象绑定？
RQ2结构化表示（成分树 vs 场景图）如何影响组合性和图像保真度？
RQ3该方法在保持图像质量的同时是否能推广到一般提示？
RQ4导致不正确的组合的原因是什么，注意力映射与布局和内容之间有何关系？

主要发现

StructureDiffusion 在与基线 Stable Diffusion 的对比中获得 5-8% 的优势。
该方法在对象级和场景级的组合性方面有所提升，包括颜色正确性和减少缺失对象。
该方法在总体图像保真度和多样性方面维持与基线指标（IS/FID/R-Prec）相当。
场景图输入和成分分析都支持结构化引导，在颜色绑定和对象完整性方面有定性的提升。
提出了新的 ABC-6K 基准，用于在组合提示中评估属性绑定，此外还有 CC-500 和通用 MSCOCO 提示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。