QUICK REVIEW

[论文解读] Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Mark F. Hamilton, Zhoutong Zhang|arXiv (Cornell University)|Mar 16, 2022

Multimodal Machine Learning Applications被引用 114

一句话总结

STEGO 将预训练的无监督特征对应关系蒸馏为紧凑的离散分割头，通过从自监督特征中学习，在 CocoStuff 和 Cityscapes 上实现最先进的无监督语义分割，且无需标签。

ABSTRACT

Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO ($ extbf{S}$elf-supervised $ extbf{T}$ransformer with $ extbf{E}$nergy-based $ extbf{G}$raph $ extbf{O}$ptimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is a novel contrastive loss function that encourages features to form compact clusters while preserving their relationships across the corpora. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff ($ extbf{+14 mIoU}$) and Cityscapes ($ extbf{+9 mIoU}$) semantic segmentation challenges.

研究动机与目标

Demonstrate that unsupervised deep features exhibit semantically consistent correlation patterns.
Introduce STEGO, a transformer-based architecture that distills feature correspondences into discrete segmentation labels.
Show that the distillation approach yields state-of-the-art unsupervised segmentation on CocoStuff and Cityscapes.
Provide ablations to justify design choices and training signals.

提出的方法

Compute a dense feature correspondence tensor F between image feature maps using cosine similarities.
Define a segmentation feature tensor S and a correlation loss L_corr that aligns S with F via element-wise interaction.
Clamp segmentation signals to zero and apply spatial centering to stabilize learning and improve small-object handling.
Train a lightweight segmentation head on frozen backbones using self, KNN, and random pair losses with a simple loss L = lambda_self L_corr(x,x,b_self) + lambda_knn L_corr(x,x_knn,b_knn) + lambda_rand L_corr(x_rand,b_rand).
Cluster the distilled features with minibatch K-means and refine with CRF post-processing to obtain final semantic maps.
Five-crop training and CRF refinement improve results and detail recovery.

实验结果

研究问题

RQ1 Do unsupervised features exhibit correlation patterns that align with semantic labels across images?
RQ2 Can a lightweight segmentation head distill these feature correspondences into discrete, cluster-friendly representations?
RQ3 How does STEGO perform on standard unsupervised semantic segmentation benchmarks compared to prior methods?
RQ4 Which architectural and training choices most significantly impact performance (ablation findings)?

主要发现

STEGO achieves state-of-the-art unsupervised segmentation on CocoStuff, with +14 mIoU over prior art.
STEGO achieves state-of-the-art unsupervised segmentation on Cityscapes, with +9 mIoU over prior art.
On CocoStuff, STEGO reports unsupervised Acc 56.9 and mIoU 28.2; linear-probe Acc 76.1 and mIoU 41.0.
On Cityscapes, STEGO reports unsupervised Acc 73.2 and mIoU 21.0.
Compared to PiCIE and other baselines, STEGO yields stronger clustering quality and finer object detail, aided by 5-crop training and CRF post-processing.
Ablations show that 0-clamp, spatial centering (SC), five-crop, and CRF all contribute to performance gains.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。