QUICK REVIEW

[論文レビュー] Efficient Training of Visual Transformers with Small Datasets

Yahui Liu, Enver Sangineto|arXiv (Cornell University)|Jun 7, 2021

Advanced Neural Network Applications被引用数 84

ひとこと要約

この論文は小規模データセット上でVisual Transformers (VTs)を分析し、VTの訓練を正則化する自己 supervis ed dense relative localization loss を提案、データが限られている場合に特に精度を向上させる。複数のVTアーキテクチャとデータセットで一貫した利得を示し、時には劇的な改善もある。

ABSTRACT

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.

研究の動機と目的

Compare the robustness of different second-generation Visual Transformers when trained from scratch or with limited data.
Introduce a self-supervised auxiliary task to regularize VT training without extra annotations.
Evaluate the proposed method across diverse datasets and training regimes to quantify gains.

提案手法

画像をVTの最終的な k×k グリッドの埋め込みとして表現し、埋め込み間の相対距離を予測する軽量MLPを追加する。
埋め込みペアをサンプリングし、正規化された2Dグリッド距離を目標オフセットへ回帰させる密な相対局所化損失を定義する。
L_drlocを標準のクロスエントロピーと固定ウェイト lambda の多目的目的関数として組み合わせる。
アーキテクチャを超えた安定した収束を確保するため局所化タスクには7×7グリッドを使用する。
ベースのVTアーキテクチャを変更せずに最終トークン埋め込みに局所化MLPを適用する。

実験結果

リサーチクエスチョン

RQ1異なる第2世代Visual Transformersは、小規模または中規模データセットで互いに、そしてResNetと比較してどのように性能を示すか。
RQ2データが不足している場合やドメインシフトがある場合に、自己 supervis ed 的補助タスクはVT訓練を改善できるか。
RQ3提案された密な相対局所化損失は、Scratchからの訓練やファインチューニングといったさまざまなVTアーキテクチャおよび訓練 regimeと広く適合するか。

主な発見

VTは、小規模データセットで大きな性能変動を示し、ImageNetの結果が似ていても違いが大きい。
CvTは、いくつかのデータセットにおいてSwinやT2Tよりも小データ域での頑健性が高い傾向にある。
密な相対局所化損失（L_drloc）を追加すると、アーキテクチャやデータセットを問わずVTの精度が一貫して向上し、場合によっては大きなマージン（場合によっては45ポイントまで）を達成する。
L_drlocは、特にscratchから訓練する場合やエポック数が限られているときに顕著な正則化効果を提供し、ResNetsにも控えめな利益をもたらす。
この手法は既存のVTに組み込みやすく、追加のアノテーションには依存しない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。