QUICK REVIEW

[论文解读] Self-Supervised Visual Representation Learning with Semantic Grouping

Xin Wen, Bingchen Zhao|arXiv (Cornell University)|May 30, 2022

Domain Adaptation and Few-Shot Learning被引用 25

一句话总结

SlotCon 在以场景为中心的图像中，通过可学习原型进行数据驱动的语义分组以及槽位级对比学习，学习对象/组级表示，从而提升下游检测、分割和无监督语义任务。

ABSTRACT

In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability. Instead, we propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning. The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and conversely facilitates grouping semantically coherent pixels together. Compared with previous efforts, by simultaneously optimizing the two coupled objectives of semantic grouping and contrastive learning, our approach bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. Experiments show our approach effectively decomposes complex scenes into semantic groups for feature learning and significantly benefits downstream tasks, including object detection, instance segmentation, and semantic segmentation. Code is available at: https://github.com/CVMI-Lab/SlotCon.

研究动机与目标

从无标签的场景中心数据中学习视觉表征，而不是依赖手工设计的对象先验的动机。
提出一个全数据驱动的框架，同时发现语义分组（槽位）并学习判别性表示。
实现对下游任务的迁移能力，如对象检测、实例分割和语义分割。
展示语义分组在真实世界场景数据上的鲁棒性和泛化能力的提升。

提出的方法

引入 SlotCon，包含两网络（学生网络和教师网络），共享像素嵌入并学习 K 个原型（语义中心）。
在像素级执行深度聚类，通过对归一化投影和原型的 softmax 将像素分配到原型，产生逐像素的分组分配。
使用反增强对齐来处理视图间的空间错配，并通过交叉熵损失（Group loss）强化跨视图的分组一致性。
维持一个均值对数 logits c 以防止坍塌，并使用教师–学生温度差分（tau_t < tau_s）。
通过对投影使用分配进行注意力池化提取组级槽位，生成 K 个组向量（槽位）。
应用基于 InfoNCE 的槽位级对比损失以对齐跨视图的槽位并区分不同槽位，采用掩蔽以忽略非占优槽位（Slot loss）。
将 Group loss 与 Slot loss 组合成总体目标 L = lambda_g * Group + (1 - lambda_g) * Slot，并通过动量教师（EMA）更新教师参数来优化。

实验结果

研究问题

RQ1是否可以在没有手工对象先验的场景中心数据上，以端到端的数据驱动方式学习语义分组？
RQ2联合语义分组和槽位级对比学习是否能改善对象/分组级表示并迁移到下游任务？
RQ3原型数量及分组与槽位损失之间的平衡对下游性能有何影响？
RQ4模型在无标签的真实世界场景（如 COCO-Stuff）中发现语义分组的效果如何，与现有无监督方法相比？

主要发现

SlotCon 在以 COCO 或 ImageNet-1K 预训练时，在 COCO 目标检测与分割、以及 Cityscapes、VOC 和 ADE20K 的语义分割上实现了强烈的迁移性能。
在 COCO 预训练下，SlotCon 在下游任务上报告 COCO 检测/分割的 AP^b = 41.0，AP_50^b = 61.1，AP_75^b = 45.0，AP^m = 37.0，AP_50^m = 58.3，AP_75^m = 39.8；City = 76.2，VOC = 71.6，ADE = 39.0。
在 COCO 预训练下，SlotCon 在各任务上优于此前的对象/组级 SSL 方法，缩小了与无对象先验的对象中心预训练之间的差距。
在 COCO-Stuff 上的无监督语义分割得到 mIoU = 18.26 和 pAcc = 42.36，超过了该指标下的若干先前方法。
消融研究表明，平衡的分组与槽位损失（lambda_g ≈ 0.5）以及适当数量的原型（例如 COCO 的 K = 256）有利于性能和迁移能力。
SlotCon 展示了语义分组与组级对比学习的互补收益，使得能够从场景中心数据中获得面向对象的表示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。