QUICK REVIEW

[論文レビュー] Self-Supervised Visual Representation Learning with Semantic Grouping

Xin Wen, Bingchen Zhao|arXiv (Cornell University)|May 30, 2022

Domain Adaptation and Few-Shot Learning被引用数 25

ひとこと要約

SlotConは、学習可能なプロトタイプによるデータ駆動型のセマンティックグルーピングとスロットレベルのコントラスト学習を共同で実行することで、シーン中心の画像から物体/グループレベルの表現を学習し、下流の検出、セグメンテーション、無監督セマンティックタスクを改善します。

ABSTRACT

In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability. Instead, we propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning. The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and conversely facilitates grouping semantically coherent pixels together. Compared with previous efforts, by simultaneously optimizing the two coupled objectives of semantic grouping and contrastive learning, our approach bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. Experiments show our approach effectively decomposes complex scenes into semantic groups for feature learning and significantly benefits downstream tasks, including object detection, instance segmentation, and semantic segmentation. Code is available at: https://github.com/CVMI-Lab/SlotCon.

研究の動機と目的

unlabeled scene-centric dataから視覚表現を学習する動機づけ。
手作りのオブジェクト priors に頼るのではなく、完全にデータ駆動のフレームワークを提案し、セマンティックグループ（スロット）を発見し、識別的表現を同時に学習する。
物体検出、インスタンスセグメンテーション、セマンティックセグメンテーションといった下流タスクへの転移を可能にする。
セマンティックグルーピングが実世界のシーンデータに対する頑健性と一般化を向上させることを示す。

提案手法

ピクセル埋め込みを共有する student と teacher の二つのネットワークを導入し、K個のプロトタイプ（セマンティック中心）を学習する。
正規化された射影とプロトタイプに対するsoftmaxを介してピクセルをプロトタイプに割り当て、ピクセルごとのグループ割り当てを生成する。
空間的な視点間のミスアライメントを扱うための逆 augmentations の整列を用い、 cross-entropy loss（Group loss）でビュー間のグルーピング整合性を課す。
へ collapsedを防ぐための mean-logit c を維持し、teacher–student の温度差異（tau_t < tau_s）を適用する。
割り当てを用いた射影のアテンティブプーリングによりグループレベルのスロットを抽出し、K個のグループベクトル（スロット）を生成する。
Slot loss（Dominantでないスロットを無視するマスキングを伴う Slot レベルの InfoNCE に基づくコントラスト損失）を用いて、ビュー間でスロットを整列させ、異なるスロットを識別する。
Group loss と Slot loss を全体最適化の目的関数 L = lambda_g * Group + (1 - lambda_g) * Slot の形で結合し、モメンタム教師（EMA）で教師パラメータを更新して最適化する。

実験結果

リサーチクエスチョン

RQ1セマンティックグルーピングは、シーン中心データに対する手作りのオブジェクト priors なしで、データ駆動かつエンドツーエンドで学習できるか？
RQ2連結したセマンティックグルーピングとスロットレベルのコントラスト学習は、オブジェクト/グループレベルの表現を改善し、下流タスクへ転移を促進するか？
RQ3プロトタイプの数とグルーピング損失とスロット損失のバランスが下流の性能にどのような影響を与えるか？
RQ4COCO-Stuff のようなラベルなしの実世界シーンで、 priorな監視なし手法と比較してモデルはセマンティックグループをどれだけうまく発見できるか？

主な発見

SlotConは、COCOまたは ImageNet-1Kで事前学習した場合、COCOのオブジェクト検出/セグメンテーションおよび Cityscapes、VOC、ADE20K のセマンティックセグメンテーションで強力な転移性能を示す。
COCO事前学習時、SlotConは検出/セグメンテーションの指標で AP^b = 41.0、AP_50^b = 61.1、AP_75^b = 45.0、AP^m = 37.0、AP_50^m = 58.3、AP_75^m = 39.8（COCO検出/セグメンテーション）および City = 76.2、VOC = 71.6、ADE = 39.0 を報告する。
COCO事前学習時、SlotConは従来の物体/グループレベルSSL手法を上回り、オブジェクト中心の事前学習に近づくがオブジェクト priors は必要としない。
COCO-Stuff での無監督セマンティックセグメンテーションは mIoU = 18.26 および pAcc = 42.36 を達成し、この指標でいくつかの従来法を上回る。
アブレーション研究は、バランスの取れたグルーピングとスロット損失（λ_g ≈ 0.5）および適切なプロトタイプ数（例: COCO の場合 K = 256）が性能と転移性に有益であることを示す。
SlotConはセマンティックグルーピングとグループレベルのコントラスト学習から補完的な利点を示し、シーン中心データからオブジェクト中心の表現を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。