QUICK REVIEW

[논문 리뷰] Mugs: A Multi-Granular Self-Supervised Learning Framework

Pan Zhou, Yichen Zhou|arXiv (Cornell University)|2022. 03. 27.

Domain Adaptation and Few-Shot Learning인용 수 20

한 줄 요약

Mugs는 인스턴스, 로컬-그룹, 그룹 차별화의 세 가지 보완 감독을 포함한 다계층 SSL 프레임워크를 도입하여 인스턴스- 로컬-그룹-레벨 특징을 학습하고 ImageNet-1K에서 선형 탐색 최첨단(SOTA)을 달성하며 강력한 전이 능력을 보여준다.

ABSTRACT

In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require different or multi-granular features, e.g.~fine- or coarse-grained one or their mixture. In this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn multi-granular visual features. Mugs has three complementary granular supervisions: 1) an instance discrimination supervision (IDS), 2) a novel local-group discrimination supervision (LGDS), and 3) a group discrimination supervision (GDS). IDS distinguishes different instances to learn instance-level fine-grained features. LGDS aggregates features of an image and its neighbors into a local-group feature, and pulls local-group features from different crops of the same image together and push them away for others. It provides complementary instance supervision to IDS via an extra alignment on local neighbors, and scatters different local-groups separately to increase discriminability. Accordingly, it helps learn high-level fine-grained features at a local-group level. Finally, to prevent similar local-groups from being scattered randomly or far away, GDS brings similar samples close and thus pulls similar local-groups together, capturing coarse-grained features at a (semantic) group level. Consequently, Mugs can capture three granular features that often enjoy higher generality on diverse downstream tasks over single-granular features, e.g.~instance-level fine-grained features in contrastive learning. By only pretraining on ImageNet-1K, Mugs sets new SoTA linear probing accuracy 82.1$\%$ on ImageNet-1K and improves previous SoTA by $1.1\%$. It also surpasses SoTAs on other tasks, e.g. transfer learning, detection and segmentation.

연구 동기 및 목표

다양한 다운스트림 태스크(거친-세밀-다계층 특징)에 대응하는 다계층 표현의 필요성을 제시한다.
인스턴스, 로컬-그룹, 그룹 차별화의 세 가지 보완 감독을 통해 다계층 시각 특징을 명시적으로 학습하는 자기지도 프레임워크를 제안한다.
다계층 학습이 분류, 탐지, 분할, 비디오 태스크 전반의 일반성 및 전이성을 향상시킨다는 것을 입증한다.
비전 트랜스포머를 사용한 ImageNet-1K에서의 평가 및 여러 평가 프로토콜에서 최첨단 SSL 방법들과의 비교를 수행한다.

제안 방법

인스턴스 수준의 미세한 특징을 위한 인스턴스 차별화(IDS)라는 세 가지 계층적 감독을 도입한다.
작은 트랜스포머를 이용해 이미지와 이웃 이미지를 로컬-그룹 특징으로 집계하는 로컬-그룹 차별 감독(LGDS)을 제안하고, 자르는 영역 간 로컬-그룹을 정렬한다.
온라인 클러스터링 프로토타입을 활용한 그룹 차별 감독(GDS)으로 거친 의미론적 그룹 특징을 포착하고, 소프트 가짜레이블과 그룹 할당에 대한 교차 엔트로피 손실을 사용한다.
L_instance, L_local-group, L_group를 동일 가중치(각 1/3)로 결합한 결합 목적 함수와 EMA를 통한 교사 업데이트를 수행한다.
다중 자르기(multi-crop) 학습 설정(두 개의 큰 자르기와 다수의 작은 자르기)과 음수 샘플 및 로컬-그룹 샘플을 위한 메모리 버퍼를 사용한다.
ImageNet-1K에서 선형 탐색, KNN, 미세조정, 준지도 설정으로 평가하며, MoCo, SimCLR, BYOL, SwAV, DINO, iBOT 등과 비교한다.

실험 결과

연구 질문

RQ1SSL 표현이 인스턴스-로컬-그룹 수준의 시맨틱을 동시에 인코딩하여 다운스트림 태스크 성능을 향상시킬 수 있는가?
RQ2세 가지 계층적 감독이 단일 계층 SSL 방식보다 더 일반적이고 전이 가능한 특징을 어떻게 만들어내는가?
RQ3다계층 감독이 ImageNet-1K에서 선형 탐색, KNN, 미세조정, 준지도 학습에 미치는 영향은 무엇인가?
RQ4학습된 다계층 특징이 탐지 및 분할과 같은 분류를 넘어 전이되는 효과가 있는가?

주요 결과

방법	아키텍처	#매개변수	데이터셋	에폭	선형 프로빙	k-NN
MoCo-v3	ResNet-50	23M	ImageNet-1K	1600	74.6	—
SimCLR	ResNet-50	23M	ImageNet-1K	1600	69.3	—
InfoMin Aug	ResNet-50	23M	ImageNet-1K	1600	73.0	—
SimSiam	ResNet-50	23M	ImageNet-1K	1600	71.3	—
BYOL	ResNet-50	23M	ImageNet-1K	2000	74.3	—
SwAV	ResNet-50	23M	ImageNet-1K	2400	75.3	65.7
DeepCluster-v2	ResNet-50	23M	ImageNet-1K	2400	75.2	—
DINO	ResNet-50	23M	ImageNet-1K	3200	75.3	67.5
MoCo-v3	ViT-S/16	21M	ImageNet-1K	3200	73.4	—
SwAV	ViT-S/16	21M	ImageNet-1K	3200	73.5	66.3
DINO	ViT-S/16	21M	ImageNet-1K	3200	77.0	74.5
iBOT	ViT-S/16	21M	ImageNet-1K	3200	77.9	75.2
Mugs	ViT-S/16	21M	ImageNet-1K	3200	78.9	75.6
MoCo-v3	ViT-B/16	85M	ImageNet-1K	1200	76.7	—
DINO	ViT-B/16	85M	ImageNet-1K	1600	78.2	76.1
iBOT	ViT-B/16	85M	ImageNet-1K	1600	79.5	77.1
Mugs	ViT-B/16	85M	ImageNet-1K	1600	80.6	78.0
MoCo-v3	ViT-L/16	307M	ImageNet-1K	1200	77.6	—
iBOT	ViT-L/16	307M	ImageNet-1K	1000	81.0	78.0
Mugs	ViT-L/16	307M	ImageNet-1K	1000	82.1	80.3

Mugs는 ImageNet-1K에서 선형 탐색 정확도 최첨단(SOTA)을 달성한다(ViT-L/16으로 ImageNet-1K에서 사전학습 시 82.1%).
Mugs는 여러 모델 크기(ViT-S/16, ViT-B/16, ViT-L/16) 및 사전학습 에폭에서 이전 SOTA 방법보다 최소 0.8% 포인트 이상 개선되었다(다양한 설정에서 선형 탐색 기준).
KNN에서 Mugs는 백본 간 가장 높은 정확도를 보여주며, 차이가 최대 2.3%p까지 증가한다.
미세조정 및 준지도 설정에서 Mugs는 ViT-S/16 및 ViT-B/16에서 새로운 SOTA를 달성하고, 제한된 라벨 데이터에서도 강력한 성능을 나타낸다(예: 1%/10% 라벨 데이터).
Mugs는 탐지 및 분할과 같은 downstream 태스크로의 강한 전이성을 보여주어 학습된 다계층 특징의 일반성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.