[论文解读] SiT: Self-supervised vIsion Transformer
SiT 引入 Group Masked Model Learning (GMML) 用于 Vision Transformers 的自监督预训练,利用掩码 token 重建和对比学习,在小/中等数据集上超越有监督预训练,并在大规模数据上竞争。
Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT.
研究动机与目标
- 通过利用自监督学习(SSL)来减少对有标签数据的依赖,激励 Vision Transformer。
- 提出 GMML,使 ViTs 能从有限数据中学习局部归纳偏置。
- 开发一个基于变换器的自编码器,支持多任务自监督目标(重建和对比学习)。
- 证明使用 SiT 的 SSL 在多数据集和迁移场景中可超越有监督预训练。
提出的方法
- 采用 Vision Transformer (ViT) 主干,并配备轻量解码器,形成一个 transformer 自编码器。
- 通过掩盖一组组令牌来重建局部图像内容,应用 Group Masked Model Learning (GMML)。
- 联合优化重构损失(L_recons)和对比损失(L_contr)在增强视图之间。
- 利用动量编码器进行对比学习,以提高表征稳定性。
- 端到端训练使用 L_total = alpha * L_recons + L_contr,alpha 针对小数据集与大数据集进行调优。
实验结果
研究问题
- RQ1GMML 是否能使 Vision Transformer 在有限监督下从无标签数据中学习到有效的表征?
- RQ2在小/中等数据集上,使用 SiT 的自监督预训练是否优于有监督预训练?
- RQ3在领域迁移和下游任务微调下,SiT 的表现如何?
- RQ4将重建与对比目标结合对 ViTs 有何影响?
- RQ5一个轻量级的 Transformer 解码器是否足以实现 ViTs 的有效 SSL?
主要发现
| 方法 | Flowers | Pets | CUB | Aircraft | STL10 | Cars | CIFAR10 | CIFAR100 |
|---|---|---|---|---|---|---|---|---|
| 随机初始化 | 68.8 | 47.5 | 25.3 | 31.1 | 77.1 | 27.4 | 96.9 | 77.8 |
| MoCo-v3 [72] | 88.9 | 69.0 | 53.1 | 62.5 | 95.4 | 84.0 | 97.3 | 83.4 |
| Dino [73] | 82.4 | 58.0 | 43.6 | 49.3 | 92.1 | 73.0 | 96.8 | 78.9 |
| MAE [57] | 86.9 | 73.0 | 59.4 | 69.0 | – | 91.0 | – | – |
| SiT | 92.8 | 84.7 | 71.2 | 77.8 | 96.5 | 92.1 | 98.2 | 85.2 |
- 在未使用外部数据进行预训练时,SiT 在若干小/中等数据集上始终优于有监督预训练和先前的自监督方法。
- 在小数据集上,SiT 实现显著提升(例如在多个微调任务上优于替代方法)。
- 在较大数据进行预训练时,SiT 达到与使用更大模型或数据的最新自监督方法相当甚至超越。
- GMML 使 ViTs 能从部分令牌损坏中学习局部归纳偏置,提高下游任务的泛化能力。
- SiT 展现出强大的领域迁移能力,并在目标数据集上微调时表现具竞争力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。