QUICK REVIEW

[论文解读] Semi-supervised Vision Transformers at Scale

Zhaowei Cai, Avinash Ravichandran|arXiv (Cornell University)|Aug 11, 2022

Advanced Neural Network Applications被引用 21

一句话总结

Semi-ViT 引入基于 EMA-Teacher 的 SSL 流水线和用于视觉变换器的概率伪混合，在仅有极少标签的情况下实现 ImageNet 上的最先进 SSL 结果，并且在不同模型规模上具有可扩展的性能。

ABSTRACT

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.

研究动机与目标

展示 Vision Transformers (ViT) 在不同规模上的半监督学习效果。
提出一个稳定的 SSL 流水线，包括自监督/监督前训练、监督微调，以及半监督微调。
通过采用 EMA-Teacher 和基于置信度的筛选来解决 ViT 在 FixMatch 中的不稳定性。
引入概率伪混合以对未标注数据进行正则化并改善对嘈杂伪标签的使用。
展示 ViTs 在 SSL 中的可扩展性优势，并量化不同数据集的标签效率提升。

提出的方法

采用一个 SSL 流水线：在所有数据上可选的自监督/自监督预训练，然后对有标签数据进行监督微调，再对所有数据进行半监督微调。
用 EMA-Teacher 取代 FixMatch，以在对 ViT 进行 SSL 微调时稳定训练（教师通过指数移动平均更新）。
在对未标注数据进行弱增强时使用教师的伪标签，当置信度超过阈值时对强增强的学生样本进行监督。
引入概率伪混合，其混合比例由样本置信度决定，使未标注样本和伪标签的加权插值成为可能。
应用带置信门控的掩蔽损失，将有标签的交叉熵损失与未标注损失结合以缓解嘈杂伪标签。
通过评估 ViT-Small 至 ViT-Huge 来展示可扩展性，并与 CNN SSL 基线及完全监督上界进行比较。

实验结果

研究问题

RQ1当使用精心设计的 SSL 流水线时，纯粹的 Vision Transformers 能否在与 CNN 相竞争的 SSL 性能上达到水平？
RQ2与 FixMatch 相比，EMA-Teacher 是否提高了 ViT 基 SSL 的稳定性和准确性？
RQ3在不同标签制情况下，概率伪混合对 ViT SSL 的正则化和性能有何影响？
RQ4Semi-ViT 在扩大模型规模的同时在多大程度上可以维持或提升 SSL 性能？
RQ5在 ImageNet 及其他数据集上，使用 Semi-ViT 的标签效率提升有多大？

主要发现

模型	参数	方法	1%	10%	100%
ViT-Base	86M	finetune	57.4	73.7	83.7
Semi-ViT	71.0	79.7	-	-	-
ViT-Large	307M	finetune	67.1	79.2	86.0
Semi-ViT	77.3	83.3	-	-	-
ViT-Huge	632M	finetune	71.5	81.4	86.9
Semi-ViT	80.0	84.3	-	-	-

Semi-ViT 在各 ViT 尺度上实现了与 CNN 对手相比的竞争性或更优的 SSL 性能。
EMA-Teacher 相较于 FixMatch 在 ViT SSL 上表现更好，提供稳定的训练和更高的准确性。
在没有大量预训练的情况下，概率伪混合相较于标准伪混合和伪混合+ 能带来一致的提升。
自监督预训练（如 MAE）显著提升 SSL 结果，即使只有 1% 标签也能实现强劲表现。
Semi-ViT-Huge 在 ImageNet 上以 1% 标签达到 80.0% 的 top-1，10% 标签达到 84.3%，接近全监督上界，而所需标注显著更少。
Semi-ViT 在其他数据集（Food-101，iNaturalist，GoogleLandmark）有强迁移，1% 标签提升 13-21 点，10% 标签提升 7-10 点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。