QUICK REVIEW

[论文解读] Token Contrast for Weakly-Supervised Semantic Segmentation

Lixiang Ru, Heliang Zheng|arXiv (Cornell University)|Mar 2, 2023

Advanced Neural Network Applications被引用 9

一句话总结

本文提出 Token Contrast (ToCo) 用于 WSSS 的 Vision Transformer，通过 (1) Patch Token Contrast (PTC) 使最终补丁令牌与中间层语义对齐，以及 (2) Class Token Contrast (CTC) 在不确定区域与整体对象之间强制局部-全局表示的一致性，在 VOC 和 COCO 上实现强大的单阶段 WSSS 结果。

ABSTRACT

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.

研究动机与目标

通过利用 ViT 捕捉全局对象区域，解决 CAM 在 WSSS 中的局限性。
通过用中间层的语义信号监督最终 Patch Token（PTC），缓解 ViT 的过平滑问题。
利用基于类令牌的局部-全局对比（CTC）区分不确定的 CAM 区域。
开发一个单阶段的 WSSS 框架，达到与多阶段方法相竞争的性能。

提出的方法

引入 Patch Token Contrast (PTC)，使用来自中间 ViT 层的伪令牌关系来监督最终 Patch Token。
通过辅助分类器从中间层推导辅助 CAM，并用它来为 PTC 形成可靠的令牌标签。
定义 Class Token Contrast (CTC)，将全局类令牌与来自不确定区域的局部裁剪对齐，并对背景裁剪使用 InfoNCE 损失进行对比。
将 L_cls、L_cls^m、L_ptc、L_ctc 与分割损失 L_seg 结合，形成端到端的 ToCo 目标函数进行训练。
对用于全局投影头的 CTC 采用基于 EMA 的更新，以稳定局部-全局令牌对齐。
将 ToCo 集成到一个单阶段 WSSS 框架中，结合 PAR 精炼和一个简单解码器实现像素级预测。

Figure 1 : The generated CAM and the pairwise cosine similarity of patch tokens ( $sim.$ map). Our method can address the over-smoothing issue well and produce accurate CAM. Here we use ViT-Base.

实验结果

研究问题

RQ1中间 ViT 表征是否能够提供语义多样性以对抗最终 Patch Token 的过平滑？
RQ2用中间层 CAM 对最终 Patch Token 进行监督是否能提升 CAM 的质量与 WSSS 的伪标签？
RQ3在全局与局部视图之间进行基于类令牌的对比是否能提升 CAM 中不确定区域的激活？
RQ4与基于图像级标签的单阶段与多阶段 WSSS 方法相比，ToCo 在 VOC 与 COCO 上的表现如何？

主要发现

监督	网络	VOC val	VOC test	COCO val
ToCo	ViT-B	69.8	70.5	41.3
ToCo†	ViT-B†	71.1	72.2	42.3

ToCo 显著提升 CAM 质量和下游分割性能，相较于 ViT 基线有提升。
PTC 有效减少补丁令牌的过平滑，使在 ablations 中 VOC val 的最终 CAM mIoU 从 27.9% 提升到 62.5%。
CTC 进一步在 CAM 质量上提升 4.7% mIoU，并带来类似半监督的局部-全局一致性。
在 VOC 上，ToCo 使用 ViT-B 时 val 达到 70.5% mIoU，且不同变体的范围为 68.1–70.5；在 VOC test 根据预训练权重不同达到 72.3–72.2% mIoU。
在 COCO val 上，ToCo 达到 42.3% mIoU，使用 ViT-B† 作为骨干时达到 71.1% val 和 72.2% test 的 mIoU（图像级监督）。
ToCo 的单阶段结果优于许多单阶段对手，接近或赶上使用图像级标签的多阶段方法。

Figure 2 : The average pairwise cosine similarity of patch tokens in each Transformer block. The cosine similarity is computed on the VOC train set. Here we use the ViT-Base (ViT-B) [ 12 ] architecture which includes 12 Transformer blocks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。