QUICK REVIEW

[論文レビュー] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Andreas Steiner, Alexander Kolesnikov|arXiv (Cornell University)|Jun 18, 2021

Advanced Neural Network Applications参考文献 37被引用数 53

ひとこと要約

この論文は、データ、拡張、正則化がVision Transformer (ViT) の性能に与える影響を大規模かつ統制された研究を行い、変動する計算予算の下で transfer learning と scratch からの学習を評価します。

ABSTRACT

Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

研究の動機と目的

ViTs における訓練データサイズ、拡張、正則化がどのように相互作用するかを理解する。
さまざまな AugReg およびデータレジームで訓練された ViT モデルの転移可能性を定量化する。
計算資源制約の下での事前学習データ、拡張、モデル選択に関する実践的推奨を提供する。
多様なダウンストリームタスクに対して、scratch からの訓練と事前学習済み ViT モデルの転移を比較する。

提案手法

ImageNet-1k および ImageNet-21k で、制御された AugReg 設定を用いて複数の ViT 構成 (Ti, S, B, L) およびハイブリッドを事前訓練する。
正則化としてドロップアウトと確率的深さを適用する。データ拡張には Mixup と RandAugment を使用し、二つのウェイト減衰値を探索する。
事前訓練には cosine 学習率スケジュールとウォームアップを用いて Adam を使用する。データセット全体で前処理と評価を標準化する。
複数のデータセットと解像度で SGD による下流で微調整する。VTAB-3/VTAB（最大 19 タスク）で転移性能を評価する。
固定された計算予算の下で転移と scratch からの学習を比較する。上流データサイズが転移性能へ及ぼす影響を分析する。）

実験結果

リサーチクエスチョン

RQ1ViTs においてデータ拡張と正則化はデータセットサイズおよびモデル容量とどのように相互作用するか？
RQ2より大きな上流データ（ImageNet-21k）での事前訓練は、さまざまなダウンストリームタスクで転移性能を改善しますか？
RQ3実用的なデータセットに対して、事前訓練済み ViT モデルの転移は scratch からの訓練よりコスト効率が高く、より良い結果をもたらしますか？
RQ4モデルサイズ、パッチサイズ、計算予算は ViT における AugReg の価値にどのような影響を与えますか？
RQ5新しいタスクへの転移に向けて事前学習モデルを選択する際にどのような指針が提供できますか？

主な発見

綿密な拡張と正則化は、10倍のデータで訓練したモデルの精度に匹敵することがある。
事前訓練済みモデルの転移は一般にコスト効率が高く、多くの実用的なデータセットでより良い結果をもたらす。
ImageNet-21k での事前訓練は ImageNet-1k と比較して VTAB のタスク全体で転移性能を改善し、特に大きな計算予算で顕著である。
AugReg は、計算量を適切に増やさないと ImageNet-21k で事前訓練する場合に性能を低下させることが多く、特に小さなモデルで効果が顕著に現れる。
より多くの上流データは、より汎用的なモデルを生み出し、さまざまなダウンストリームタスクへより良く転移する傾向がある。
上流検証精度で最適な上流モデルを選択することは転移において一般的に効果的な戦略であり、ImageNet-21k のチェックポイントを用いることを推奨します。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。