QUICK REVIEW

[论文解读] Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng, Qianqian Xu|arXiv (Cornell University)|Mar 6, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本文发现跨注意力图与语义相关性之间在无训练扩散分割器中的两个差距，并提出自动聚合（按头与按层）与逐像素重缩放（GoCA），实现与更强扩散模型的更好尺度对齐，在标准基准上获得显著性能提升，并提升与生成技术的整合。

ABSTRACT

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

研究动机与目标

识别为何无训练扩散分割器在更强扩散模型下难以扩展。
提出自动聚合与逐像素重缩放以弥合跨注意力图与语义相关性之间的差距。
在更强扩散模型下展示分割性能在多个基准上的提升。
展示与生成技术的集成以验证更广的适用性。
突出消融与定性结果以支持方法有效性。

提出的方法

将多头和多层跨注意力分解为按头和按层的贡献，以形成自动聚合权重。
利用头级和层级聚合，从按头和按层的映射中产生统一的全局注意力图。
通过密集扩散特征引入伪自注意力为层权重估算提供自估的层贡献。
通过排除语义性特殊标记并对像素内的内容词标记对注意力分数进行逐像素规范化后，再进行逐标记规范化，应用逐像素重缩放。
将精炼后的注意力图与自注意力图相乘作为后处理的分割。
可选地将GoCA与如S-CFG等生成技术整合，以提升生成质量。

Figure 1 : (a) Previous training-free diffusion segmentors scale poorly with the generative power of diffusion models, which inspires our study to enable such scaling. (b) We have identified two gaps from individual cross-attention maps to semantic correlation, which have been preventing the aforeme

实验结果

研究问题

RQ1现有的无训练扩散分割器在使用更强的扩散模型时为何难以扩展？
RQ2如何使聚合的跨注意力图更能反映全局语义相关性，以实现更可靠的分割？
RQ3自动聚合与逐像素重缩放是否能使更强的扩散模型实现更好的分割结果？
RQ4GoCA是否在标准基准上提升了分割，并增强了与生成技术的整合？

主要发现

Type	Method	VOC	Context	COCO-Object	Cityscapes	ADE20K
Non-DM	MaskCLIP	38.8	23.6	20.6	10.0	9.8
Non-DM	ReCO	25.1	19.9	15.7	19.3	11.2
Pre-Trained DM	DiffSegmentor	60.1	27.5	37.9	-	-
Pre-Trained DM	MaskDiffusion	29.9	-	-	17.1	-
Pre-Trained DM	FTTM 1	48.9	30.0	34.6	12.3	20.3
Vanilla	SD v1.5	44.3	32.3	32.3	11.8	18.0
Vanilla	SD XL	51.1	35.7	37.2	16.1	18.6
Vanilla	Pixart-Sigma	45.2	37.0	33.4	22.5	19.1
Vanilla	Flux	55.7	48.4	43.3	25.6	24.5
Baseline	SD v1.5	51.1	35.4	36.9	18.4	21.0
Ours	SD v1.5	60.7	40.4	39.2	16.1	22.0
Ours	SD XL	65.6	42.3	44.3	21.2	23.2
Ours	Pixart-Sigma	63.6	43.2	39.8	22.6	23.8
Ours	Flux	70.7	51.1	48.1	27.1	29.3

更强的扩散模型（SD XL、PixArt-Sigma、Flux）在GoCA 基于聚合的方案下受益，分割性能超越 SD v1.5。
GoCA（自动聚合 + 逐像素重缩放）在 VOC、Context、COCO-Object、Cityscapes、ADE20K 基准上均优于 Vanilla 与 Baseline 方法。
按层自动聚合的结果可与手动调优的层权重相当，完整的GoCA得到最佳性能。
消融显示两部分自动聚合（头级与层级）以及逐像素重缩放均有贡献，组合GoCA带来最大提升。
GoCA 提升的分割还能提升生成技术如 S-CFG 的表现，在 CFG 强度下获得更好的 FID 与 CLIP 分数。

Figure 2 : Attention maps in different heads and layers show a certain collaboration pattern, each focusing on distinct aspects of the image.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。