QUICK REVIEW

[论文解读] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Andreas Steiner, Alexander Kolesnikov|arXiv (Cornell University)|Jun 18, 2021

Advanced Neural Network Applications参考文献 37被引用 53

一句话总结

本论文进行了一项大规模、受控的研究，探讨数据、增强和正则化如何影响 Vision Transformers (ViT) 的性能，并在不同的计算预算下评估迁移学习与从头训练的表现。

ABSTRACT

Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

研究动机与目标

了解训练数据量、增强和正则化如何在 ViTs 中相互作用。
量化在不同 AugReg 和数据方案下训练的 ViT 模型的迁移能力。
在计算约束下，为预训练数据、增强和模型选择提供实用建议。
对比在多样的下游任务中从头训练与迁移预训练 ViT 模型。

提出的方法

在 ImageNet-1k 和 ImageNet-21k 上，以受控的 AugReg 设置对多种 ViT 配置 (Ti, S, B, L) 及混合模型进行预训练。
以 dropout 和 stochastic depth 作为正则化；使用 Mixup 和 RandAugment 进行数据增强；探索两种权重衰减值。
在预训练中使用 Adam，采用 cosine LR 调度和 warmup；在各数据集之间标准化预处理和评估。
使用 SGD 在多数据集和多分辨率上对下游进行微调；在 VTAB-3/VTAB（多达 19 个任务）上评估迁移性能。
在固定计算预算下比较迁移与从头训练；分析上游数据大小对迁移性能的影响。

实验结果

研究问题

RQ1数据增强和正则化如何与数据集大小和模型容量在 ViTs 中相互作用？
RQ2在更大上游数据（ImageNet-21k）上进行预训练是否会提升在多样下游任务上的迁移性能？
RQ3将预训练的 ViT 模型进行迁移在实际数据集上是否更具成本效益且效果更好？
RQ4模型大小、patch 大小和计算预算如何影响 ViTs 中 AugReg 的价值？
RQ5针对迁移到新任务，可以提供哪些选择预训练模型的指导？

主要发现

细致的增强和正则化可以达到相当于多一个数量级数据量的模型的准确性。
将预训练模型进行迁移通常成本更高效，并且在许多实际数据集上取得更好的结果。
在 ImageNet-21k 上的预训练相较于 ImageNet-1k，在 VTAB 任务上提升了迁移性能，尤其在更大计算预算下。
在 ImageNet-21k 上预训练时，除非相应增加计算，否则 AugReg 常常降低性能，对较小模型的影响更显著。
更多的上游数据往往会产生更通用的模型，在多样的下游任务上迁移性更好。
通过上游验证准确度选择最佳上游模型通常是迁移的有效策略；推荐使用 ImageNet-21k 的检查点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。