QUICK REVIEW

[论文解读] WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Lianghui Zhu, Yingyue Li|arXiv (Cornell University)|Apr 3, 2023

Advanced Neural Network Applications被引用 16

一句话总结

WeakTr 利用一个普通的 Vision Transformer，配合自适应注意力融合以端到端生成高质量的 CAM，并采用用于在线再训练的梯度裁剪解码器，在 VOC 2012 和 COCO 2014 上实现了最先进的 WSSS 结果。

ABSTRACT

Transformer has been very successful in various computer vision tasks and understanding the working mechanism of transformer is important. As touchstones, weakly-supervised semantic segmentation (WSSS) and class activation map (CAM) are useful tasks for analyzing vision transformers (ViT). Based on the plain ViT pre-trained with ImageNet classification, we find that multi-layer, multi-head self-attention maps can provide rich and diverse information for weakly-supervised semantic segmentation and CAM generation, e.g., different attention heads of ViT focus on different image areas and object categories. Thus we propose a novel method to end-to-end estimate the importance of attention heads, where the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results efficiently and effectively. Furthermore, the gradient clipping decoder can make good use of the knowledge in large-scale pre-trained ViT and has a scalable ability. The proposed plain Transformer-based Weakly-supervised learning method (WeakTr) obtains the superior WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of PASCAL VOC 2012 and 51.1% mIoU on the val set of COCO 2014. Source code and checkpoints are available at https://github.com/hustvl/WeakTr.

研究动机与目标

动机：在不使用卷积感应偏置的前提下，利用普通 ViT 提升 WSSS 的 CAM 质量。
提出一个自适应注意力融合模块，以对 ViT 的头部进行加权，从而生成更好的 CAM。
引入一个端到端的 CAM 训练策略，通过分类信号来优化 CAM 质量。
开发一种使用梯度裁剪解码器的在线再训练方法，以高效更新分割模型。
在 VOC 2012 和 COCO 2014 基准测试中展示最先进的 WSSS 性能。

提出的方法

使用一个普通 ViT 主干，配备 C 个类别标记和 N^2 个补丁标记作为输入喂给变换器编码器。
通过对补丁标记进行卷积来生成粗 CAM，然后用自注意力图的自适应注意力融合进行细化。
通过对注意力图进行池化并经 FFN 以获得 W' 来计算动态头部权重 W，然后对跨注意力和补丁注意力图进行加权以形成 CAM_fine。
端到端训练，使用联合损失 L = L_Fine-CAM + L_CLS-token + L_Coarse-CAM 来监督头部加权。
引入用于在线再训练的梯度裁剪解码器，通过基于全局/局部梯度统计来约束梯度流动，以更新分割网络。
推理阶段，应用 CRF 来细化分割图。

实验结果

研究问题

RQ1如何自适应地融合普通 Vision Transformer 的自注意力图，以为 WSSS 生成更高质量的 CAM？
RQ2在没有 CAM 精炼阶段的情况下，通过自适应头部加权的端到端 CAM 训练是否能提升伪标签质量？
RQ3与传统 CAM 精炼流程相比，使用带梯度裁剪解码器的在线再训练是否能提升 WSSS 的效率和精度？
RQ4使用 ViT 主干的 WeakTr 在标准 WSSS 基准（VOC 2012 和 COCO 2014）上的性能影响如何？
RQ5在 CAM 质量和最终分割 mIoU 方面，WeakTr 与最先进的 WSSS 方法相比如何？

主要发现

WeakTr 在 VOC 2012 val 和 COCO 2014 val 基准上取得了最先进的 WSSS 结果。
在 VOC 2012 val 上，使用 ViT-S 的 WeakTr 实现了 78.4% mIoU，测试集为 79.0%，超过了以前的方法。
在 VOC 2012 train 上，CAM 改进（Fine-CAM）超越了若干先前的 CAM 方法（例如 MCTformer、ViT-PCM）。
使用梯度裁剪解码器的在线再训练带来显著的训练时间节省（总体大约 2.6 倍更快），并保持高 mIoU。
自适应注意力融合（AAF）比均值求和聚合在 CAM 的精确度/召回率和 mIoU 上表现更高，特别是在使用 CRF 后处理时。
消融研究表明梯度补丁大小和裁剪起始阈值会影响最终性能，从提出的解码器中获得了有意义的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。