QUICK REVIEW

[논문 리뷰] What Do Self-Supervised Vision Transformers Learn?

Namuk Park, Wonjae Kim|arXiv (Cornell University)|2023. 05. 01.

Domain Adaptation and Few-Shot Learning인용 수 16

한 줄 요약

이 논문은 대조 학습(CL)과 마스크드 이미지 모델링(MIM)을 자기지도 학습 비전 트랜스포머에 대해 비교하여 CL이 전역 형태를 포착하는 반면 MIM은 지역 질감을 포착하고, 간단한 CL+MIM 하이브리드가 각 방법 단독보다 더 좋은 성능을 보임을 보여준다.

ABSTRACT

We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.

연구 동기 및 목표

CL과 MIM으로 학습된 자기지도 ViT가 학습된 표현 및 다운스트림 성능에서 어떻게 다르게 나타나는지 이해한다.
CL과 MIM 사이에서 자기 주의(self-attention), 표현 변환, 계층 역할이 어떻게 다른지 조사한다.
CL과 MIM이 상호 보완되어 선형 프로빙과 파인튜닝 결과를 개선할 수 있는지 분석한다.

제안 방법

ImageNet-1K를 기준으로 MoCo(CL)와 SimMIM(MIM)으로 학습된 ViT-B/16 모델을 비교한다.
자기 주의 동작, 효과적 수용영역(effective receptive fields), 계층 간 주의 다양성을 분석한다.
표현을 특징짓기 위해 선형 프로빙, 파인 튜닝, 상호정보, 코사인 유사도, 특이 값 스펙트럼을 사용한다.
표현에서 주파수 바이어스(저주파 대 고주파)를 연구하기 위해 푸리에 분석을 수행한다.
Stylized ImageNet를 통해 질감에 대한 견고성 및 고주파 노이즈에 대한 견고성을 평가한다.
CL과 MIM 목표의 간단한 선형 조합을 하이브리드 학습 방법으로 탐구한다.

Figure 1: Self-attentions of CL (MoCo) capture global information, but they collapse into homogeneous attention maps for all query tokens and heads. Self-attentions of MIM (SimMIM) mainly focus on local areas and similar tokens. We visualize the attention maps for two different query tokens in the b

실험 결과

연구 질문

RQ1CL과 MIM 간의 자기 주의가 전역 관계와 지역 관계 측면에서 어떻게 다른가?
RQ2ViT 심층에서 CL과 MIM에 의해 토큰 및 이미지 표현이 어떻게 변환되는가?
RQ3CL과 MIM에서 학습된 표현에 가장 큰 영향을 주는 층과 구성요소는 무엇인가?
RQ4CL과 MIM을 효과적으로 결합하여 서로 보완적인 강점을 활용할 수 있는가?

주요 결과

CL은 전역 관계와 객체의 형태를 포착하지만, 나중 계층에서 자체 주의가 균질한 맵으로 수렴한다.
MIM은 지역 관계와 질감을 포착하며 토큰 수준의 다양성을 유지하고 주의 붕괴를 피한다.
CL은 저주파 정보에 의존하고 MIM은 고주파 정보에 의존하여 CL은 형태 편향, MIM은 질감 편향을 시사한다.
나중 계층은 CL에 대해 특히 중요하고, 초기 계층이 MIM에 더 큰 영향을 미친다.
CL과 MIM 목표의 간단한 선형 결합이 단독 방법보다 더 나은 선형 프로빙 및 파인튜닝 성능을 보인다.
하이브리드 모델은 나중 계층에서 CL 유사 특성이 우세하고 초기 계층에서 MIM 유사 특성이 우세하다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.