QUICK REVIEW

[論文レビュー] Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng, Qianqian Xu|arXiv (Cornell University)|Mar 6, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

本論文は、訓練不要のディフュージョンセグメンターのクロスアテンションマップと意味的相関の間の2つのギャップを特定し、オートアグリゲーション（ヘッド・レイヤーごと）とピクセル毎のリスケーリング（GoCA）を導入して、より強力なディフュージョンモデルでのスケーリングを可能にし、標準ベンチマークで有意な性能向上を達成するとともに、生成技術との統合を改善します。

ABSTRACT

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

研究の動機と目的

訓練不要のディフュージョンセグメンターが強力なディフュージョンモデルでスケールしない理由を特定する。
自動集約とピクセルごとのリスケーリングを提案し、クロスアテンションマップと意味的相関のギャップを橋渡しする。
より強力なディフュージョンモデルでセグメンテーション性能をベンチマーク全体で改善する。
生成技術との統合を示し、適用範囲の広さを検証する。
アブレーションと定性的結果を強調し、手法の有効性を支持する。

提案手法

ヘッドごとおよびレイヤーごとの寄与を分解して自動集約ウェイトを形成する。
ヘッド指向およびレイヤー指向の集約を用いて、各ヘッド・レイヤーのマップから統一的なグローバルアテンションマップを生成する。
密な拡散特徴量を用いて層の寄与を推定する自己注意型の層ウェイトを導入する。
意味的特 token を除外し、内容語トークン間でピクセル毎のアテンションスコアを正規化した後、トークンごとに正規化することでピクセル毎のリスケーリングを適用する。
精錬されたアテンションマップを自己注意マップと掛け合わせてポスト処理によるセグメンテーションを行う。
必要に応じて GoCA を S-CFG などの生成技術と統合し、生成品質を改善する。

Figure 1 : (a) Previous training-free diffusion segmentors scale poorly with the generative power of diffusion models, which inspires our study to enable such scaling. (b) We have identified two gaps from individual cross-attention maps to semantic correlation, which have been preventing the aforeme

実験結果

リサーチクエスチョン

RQ1既存の訓練不要ディフュージョンセグメンターは、より強力なディフュージョンモデルを使用するとなぜスケールに失敗するのか？
RQ2集約されたクロスアテンションマップを、信頼性のあるセグメンテーションのためにグローバルな意味的相関をより反映するようにするにはどうするべきか？
RQ3自動集約とピクセル毎リスケーリングは、より強力なディフュージョンモデルがより良いセグメンテーション結果を得るのを可能にするか？
RQ4GoCA は標準ベンチマークでのセグメンテーションを改善し、生成技術との統合を強化するか？

主な発見

Type	Method	VOC	Context	COCO-Object	Cityscapes	ADE20K
Non-DM	MaskCLIP	38.8	23.6	20.6	10.0	9.8
Non-DM	ReCO	25.1	19.9	15.7	19.3	11.2
Pre-Trained DM	DiffSegmentor	60.1	27.5	37.9	-	-
Pre-Trained DM	MaskDiffusion	29.9	-	-	17.1	-
Pre-Trained DM	FTTM 1	48.9	30.0	34.6	12.3	20.3
Vanilla	SD v1.5	44.3	32.3	32.3	11.8	18.0
Vanilla	SD XL	51.1	35.7	37.2	16.1	18.6
Vanilla	Pixart-Sigma	45.2	37.0	33.4	22.5	19.1
Vanilla	Flux	55.7	48.4	43.3	25.6	24.5
Baseline	SD v1.5	51.1	35.4	36.9	18.4	21.0
Ours	SD v1.5	60.7	40.4	39.2	16.1	22.0
Ours	SD XL	65.6	42.3	44.3	21.2	23.2
Ours	Pixart-Sigma	63.6	43.2	39.8	22.6	23.8
Ours	Flux	70.7	51.1	48.1	27.1	29.3

より強力なディフュージョンモデル（SD XL、PixArt-Sigma、Flux）はGoCAベースの集約から恩恵を受け、SD v1.5をセグメンテーション性能で上回る。
GoCA（自動集約＋ピクセル毎リスケーリング）は、VOC、Context、COCO-Object、Cityscapes、ADE20Kのベンチマーク全体で Vanilla および Baseline 手法を上回る。
レイヤー単位の自動集約は、手動で調整したレイヤーウェイトと同等の結果を達成し、完全な GoCA が最良の性能を示す。
アブレーションにより、ヘッド-およびレイヤー-wise の自動集約の両方とピクセル単位リスケーリングが寄与しており、GoCA の統合が最大の改善をもたらすことを示す。
GoCA を用いたセグメンテーションは生成技術（例: S-CFG）の品質を改善し、CFG 強度に対してより良い FID および CLIP スコアを提供する。

Figure 2 : Attention maps in different heads and layers show a certain collaboration pattern, each focusing on distinct aspects of the image.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。