QUICK REVIEW

[论文解读] Better plain ViT baselines for ImageNet-1k

Lucas Beyer, Xiaohua Zhai|arXiv (Cornell University)|May 3, 2022

Advanced Neural Network Applications被引用 49

一句话总结

一个简单的 ViT 基线在 ImageNet-1k 上，经过小而非新颖的调整，达到有竞争力的性能，在 90 次 epoch 达到 76.5% top-1，在 300 次 epoch 达到 80.0%，与在类似训练条件下的 ResNet-50 水平相当。

ABSTRACT

It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.

研究动机与目标

证明简单 ViT 在最小、标准的训练调整下也能实现强劲的 ImageNet-1k 性能。
确定哪些小修改对相较于基线 ViT 的性能提升贡献最大。
提供一个简单、易复现的基线，在类似的计算条件下可与 ResNet-50 相提并论。
鼓励使用一个直接的 ViT 设置，作为未来工作强有力的参考点。

提出的方法

采用 ViT-S/16，保持原始 ViT 架构和标准数据增强。
在 ImageNet-1k 的 99% 上训练，minival 使用 99% 的划分以避免测试集调参。
应用固定的 2D sin-cos 位置嵌入和全局平均池化（GAP）而不是类别标记 token。
使用 RandAugment 和 Mixup，水平适中（RandAugment 级别 2，10；Mixup p=0.2）。
Batch size 设置为 1024（而非 4096）；分别训练 90、150、300 epoch 以衡量学习速度和准确性。
保持训练流程简单，不使用额外正则化、蒸馏或架构更改。

实验结果

研究问题

RQ1在 ImageNet-1k 上使用最小、标准增广的简单 ViT 基线的性能如何？
RQ2小的改动（位置嵌入、池化、批量大小，以及温和的增广）如何在 90、150、300 训练 epoch 的准确性上产生影响？
RQ3在可比的计算条件下，简单的 ViT 基线是否能匹配经典 ResNet-50 的性能？
RQ4每个小改动对最终 top-1 精度的相对影响有多大？

主要发现

条件	90ep	150ep	300ep
Our improvements	76.5	78.5	80.0
no RandAug+MixUp	73.6	73.7	73.7
Posemb: sincos2d → learned	75.0	78.0	79.6
Batch-size: 1024 → 4096	74.7	77.3	78.6
Global Avgpool → [cls] token	75.0	76.9	78.2
Head: MLP → linear	76.7	78.6	79.8
Original + RandAug + MixUp	71.6	74.8	76.1
Original	66.8	67.2	67.1

一个简单的 ViT 设置在 90 epoch 达到 76.5% top-1，在 300 epoch 达到 80.0% top-1。
提出的小改动在整体上相对于原始 ViT 基线带来显著的性能提升。
全局平均池化和正弦位置嵌入在此设置下优于类别标记变体。
相较于原始基线，RandAugment 和 MixUp 在温和水平下贡献显著提升。
90-epoch 的运行在 TPUv3-8 上约 6h30 完成，在类似计算时间内接近 ResNet-50 水平性能。
使用 150 epoch 的训练得到 78.5% 的 top-1；300 epoch 得到 80.0% 的 top-1（如文中所述）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。