QUICK REVIEW

[论文解读] What Do Self-Supervised Vision Transformers Learn?

Namuk Park, Wonjae Kim|arXiv (Cornell University)|May 1, 2023

Domain Adaptation and Few-Shot Learning被引用 16

一句话总结

该论文比较对比学习（CL）和掩蔽图像建模（MIM）在自监督视觉Transformer中的表现，显示CL捕捉全局形状而MIM捕捉局部纹理，并证明一个简单的CL+MIM混合方法在单独使用时的性能更优。

ABSTRACT

We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.

研究动机与目标

了解使用CL和MIM进行自监督训练的ViT在学习表示和下游性能方面有何不同。
研究自注意力、表示变换以及各层在CL与MIM中的作用差异。
分析CL和MIM是否能相互补充，以改进线性探测和微调结果。

提出的方法

使用ImageNet-1K作为基线，比较用MoCo（CL）和SimMIM（MIM）训练的ViT-B/16模型。
分析自注意力行为、有效感受野以及各层的注意力多样性。
使用线性探测、微调、互信息、余弦相似性以及奇异值谱来表征表示。
进行傅里叶分析以研究表示中的频率偏置（低频与高频）。
通过Stylized ImageNet评估对纹理的鲁棒性，以及对高频噪声的鲁棒性。
探索CL和MIM目标的简单线性组合作为混合训练方法。

Figure 1: Self-attentions of CL (MoCo) capture global information, but they collapse into homogeneous attention maps for all query tokens and heads. Self-attentions of MIM (SimMIM) mainly focus on local areas and similar tokens. We visualize the attention maps for two different query tokens in the b

实验结果

研究问题

RQ1在全局与局部关系方面，CL和MIM的自注意力有何不同？
RQ2在ViT深度中，CL和MIM如何改变令牌和图像表示？
RQ3在CL与MIM中，哪些层和组件对学习到的表示最具影响？
RQ4是否可以有效地结合CL和MIM，以利用它们的互补优势？

主要发现

CL捕捉全局关系和对象形状，但其自注意力在后续层中收敛为同质化的映射。
MIM捕捉局部关系和纹理，保留令牌级多样性，避免注意力坍缩。
CL依赖低频信息而MIM依赖高频信息，表明CL的形状偏好和MIM的纹理偏好。
后续层对CL特别重要，而早期层对MIM影响更大。
将CL和MIM目标简单线性组合比任一方法单独使用时在线性探测和微调性能更好。
混合模型显示CL样的特性在后层占主导，MIM样的特性在前层占主导。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。