QUICK REVIEW

[논문 리뷰] Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng, Qianqian Xu|arXiv (Cornell University)|2026. 03. 06.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

본 논문은 학습 없이 작동하는 확산 세그먼트에서 교차 주의도(cross-attention maps)와 의미상 상관성 사이의 두 가지 격차를 확인하고, 더 강한 확산 모델에 대한 더 나은 확장을 가능하게 하기 위해 자동 집계(헤드- 및 층별)와 픽셀 단위 재스케일링(GoCA)을 도입하여 표준 벤치마크에서 상당한 성능 향상을 달성하고 생성 기술과의 통합을 향상한다.

ABSTRACT

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

연구 동기 및 목표

학습 없이 확산 세그먼트가 더 강한 확산 모델로 확장되지 못하는 원인을 식별한다.
교차 주의 맵과 의미 상관성 간의 차이를 자동 집계로 연결하고 픽셀 단위 재스케일링으로 보완한다.
더 강한 확산 모델에서 분할 성능을 벤치마크 전반에 걸쳐 향상시키는 방법을 실증한다.
생성 기법과의 통합을 보여 주고 더 넓은 적용 가능성을 검증한다.
에이블레이션 및 정성적 결과를 하이라이트하여 방법의 효과를 뒷받침한다.

제안 방법

다중-head 및 다중-layer 교차 주의(attention)를 헤드별 및 층별 기여도로 분해하여 자동 집계 가중치를 형성한다.
헤드별 및 층별 맵으로부터 통일된 전역 주의 맵을 생성하기 위해 헤드- 및 층별 집계를 사용한다.
密한 확산 특징을 활용하여 층 기여도를 추정하는 확률적 자기 주의 기반 층 가중치를 도입한다.
의미 토큰을 제외하고 콘텐츠-단어 토큰 간의 각 픽셀에서 주의 점수를 정규화한 뒤 각 토큰별 정규화를 수행하여 픽셀별 재스케일링을 적용한다.
정제된 주의 맵에 자기 주의 맵을 곱하여 후처리 세그먼테이션을 수행한다.
선택적으로 GoCA를 S-CFG와 같은 생성 기법과 통합하여 생성 품질을 향상시킨다.

Figure 1 : (a) Previous training-free diffusion segmentors scale poorly with the generative power of diffusion models, which inspires our study to enable such scaling. (b) We have identified two gaps from individual cross-attention maps to semantic correlation, which have been preventing the aforeme

실험 결과

연구 질문

RQ1더 강한 확산 모델을 사용할 때 왜 기존의 학습 없이 동작하는 확산 세그먼트가 확장에 실패하는가?
RQ2집계된 교차 주의 맵을 더 글로벌 의미 상관관계를 반영하도록 만들어 신뢰할 만한 세그먼테이션을 달성할 수 있는가?
RQ3자동 집계와 픽셀 단위 재스케일링이 더 강한 확산 모델이 더 나은 세그먼트 결과를 얻도록 할 수 있는가?
RQ4GoCA가 표준 벤치마크 전반에서 세그먼테이션을 개선하고 생성 기법과의 통합을 강화하는가?

주요 결과

유형	방법	VOC	맥락	COCO-Object	Cityscapes	ADE20K
Non-DM	MaskCLIP	38.8	23.6	20.6	10.0	9.8
Non-DM	ReCO	25.1	19.9	15.7	19.3	11.2
Pre-Trained DM	DiffSegmentor	60.1	27.5	37.9	-	-
Pre-Trained DM	MaskDiffusion	29.9	-	-	17.1	-
Pre-Trained DM	FTTM 1	48.9	30.0	34.6	12.3	20.3
Vanilla	SD v1.5	44.3	32.3	32.3	11.8	18.0
Vanilla	SD XL	51.1	35.7	37.2	16.1	18.6
Vanilla	Pixart-Sigma	45.2	37.0	33.4	22.5	19.1
Vanilla	Flux	55.7	48.4	43.3	25.6	24.5
Baseline	SD v1.5	51.1	35.4	36.9	18.4	21.0
Ours	SD v1.5	60.7	40.4	39.2	16.1	22.0
Ours	SD XL	65.6	42.3	44.3	21.2	23.2
Ours	Pixart-Sigma	63.6	43.2	39.8	22.6	23.8
Ours	Flux	70.7	51.1	48.1	27.1	29.3

더 강한 확산 모델(SD XL, PixArt-Sigma, Flux)은 GoCA 기반 집계로 이득을 얻어 SD v1.5보다 세그먼테이션 성능이 앞선다.
GoCA(자동 집계 + 픽셀 단위 재스케일링)는 VOC, Context, COCO-Object, Cityscapes, ADE20K 벤치마크에서 Vanilla 및 Baseline 방법을 능가한다.
레이어별 자동 집계는 수동으로 조정된 층 가중치와 비견할 만한 결과를 낸다. 전체 GoCA가 최상의 성능을 보인다.
에이블레이션 결과는 자동 집계의 두 구성 요소(헤드- 및 층별)와 픽셀 단위 재스케일링이 모두 기여하며, 결합된 GoCA가 가장 큰 향상을 제공한다.
GoCA로 향상된 세그먼테이션은 S-CFG와 같은 생성 기법의 품질을 개선하여 CFG 강도에 따라 FID 및 CLIP 점수를 더 잘 반영한다.

Figure 2 : Attention maps in different heads and layers show a certain collaboration pattern, each focusing on distinct aspects of the image.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.