QUICK REVIEW

[論文レビュー] Escaping the Big Data Paradigm with Compact Transformers

Ali Hassani, Steven Walton|arXiv (Cornell University)|Apr 12, 2021

Advanced Neural Network Applications参考文献 48被引用数 295

ひとこと要約

本論文は、Scratchから小規模データセットで学習可能なコンパクトなVision Transformer（ViT-Lite、CVT、CCT）を提案し、はるかに少ないパラメータと計算量で競争力のある、あるいは最先端の精度を達成することを示します。大規模な事前学習を必要とせず、CIFAR-10/100、Flowers-102、ImageNetに対してデータ効率の高いトランスフォーマーモデルを実証します。

ABSTRACT

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.

研究の動機と目的

データが不足している小規模データセットで、トランスフォーマーモデルをScratchから訓練できるよう動機づけ、実現する。
データ効率と局所性のために、畳み込みトークン化とアテンションを組み合わせたコンパクトなTransformer変種を開発する。
SeqPoolを提案し、クラス・トークンを置換して出力トークン列のプーリングを改善する。
畳み込みトークン化器を用いたCCTが、低いパラメータ数と計算量を維持しつつ高い精度を提供することを示す。
モデルサイズとデータ域に対して、CIFAR-10/100、Flowers-102、ImageNetで最先端または競争力のある結果を示す。

提案手法

ViT-Lite、CVT、および CCT を、小データ域に適したコンパクトな Vision Transformer 変種として提案する。
CCT におけるパッチベースの標準トークン化を畳み込みトークン化に置換して、局所構造を埋め込む。
トランスフォーマー出力を単一のクラス表現に写像する、アテンションベースのシーケンス・プーリング機構 SeqPool を導入する。
AdamWとコサインアニーリングを用いて、CIFAR-10/100、CIFAR、MNIST、Fashion-MNIST、Flowers-102、ImageNet-1k から Scratch 学習したモデルを評価する。
CNNsおよびViT/DeiTベースラインと、蒸留シナリオを含む比較を行い、パラメータ数とMACsを報告する。

実験結果

リサーチクエスチョン

RQ1視覚トランスフォーマーは、大規模な事前学習なしで、小規模データセットから効果的にScratch訓練できるのか？
RQ2畳み込みトークン化とSeqPoolを備えたコンパクトなトランスフォーマーアーキテクチャは、小規模データセットでViTおよびCNNよりデータ効率的な改善を提供するのか？
RQ3畳み込みトークン化とSeqPoolの使用が、さまざまな画像データセットにおける精度と効率に与える影響は何か？
RQ4ImageNet のような中規模データセットに対して、従来のCNNやViT系変種と比べてCCTはどう性能を示すか？
RQ5限られた計算資源でトランスフォーマーモデルを展開し、競争力のある性能を維持することは可能か？

主な発見

CCTはScratchから訓練した場合、約3.7MパラメータのモデルでCIFAR-10において98%の精度を達成し、5000エポックでCIFAR-10のTable 2で98.00%を得る。
CCTはCIFAR-10/100およびFlowers-102でViTや多くのCNNベースの手法を上回り、はるかに少ないパラメータとMACsで強力な結果を示す（例：CVTおよびCCT変種は0.28–3.85Mパラメータで優れた結果）。
ImageNet-1kで、CCT-14/7×2は蒸留なしでTop-1 80.67%、パラメータ22.36Mを達成し、蒸留版CCTはTop-1 81.34%に達する。
Flowers-102の結果は、ImageNet規模の事前学習下でCCT-14/7×2が99.76% Top-1を達成し、パラメータは大幅に少なく（約22.17M）MACsは18.63Gである。
CCTは、ResNet50のおよそ15%のモデルサイズに削減しつつ、CIFAR-10/100で同等以上の性能を達成することで、データ効率に優れていることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。