QUICK REVIEW

[論文レビュー] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Yufei Xu, Qiming Zhang|arXiv (Cornell University)|Jun 7, 2021

Advanced Neural Network Applications参考文献 83被引用数 155

ひとこと要約

ViTAEは畳み込みからの内在的帰納バイアスを視覚トランスフォーマーに導入し、並列の局所性と多尺度の縮小セルを通じて、データとトレーニング効率の高いImageNet性能を実現します。

ABSTRACT

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.

研究の動機と目的

局所性およびスケール認識機能の学習を改善するために、Vision Transformer に内在的帰納バイアスを組み込むことを動機づける。
自己注意と並行して多尺度コンテキストを埋め込み局所性をモデル化する縮小セル（Reduction Cells）と、MHSAとParallel Convolutional Module（PCM）を融合させるNormal Cells（NC）を備えたViTAEを設計する。
データとトレーニング効率、分類精度、および下流タスクでの一般化性能の改善を示す。
畳み込みベースのモジュールと融合戦略の寄与を示すアブレーションを提供する。

提案手法

2種類のセルを導入する：複数の拡張率とダウンサンプリングを用いたピラミッド縮小モジュールで多尺度コンテキストを埋め込む Reduction Cells（RC）と、MHSAと Parallel Convolutional Module（PCM）を融合させる Normal Cells（NC）。
RCは入力をそれぞれ4倍、2倍、2倍のダウンサンプリングを行い、サイズH/16 x W/16のトークンを生成する；RCの出力は平坦化され、NCへ入る前にクラス・トークンと連結される。
RCのピラミッド縮小モジュールは多様なレートの拡張畳み込みを用いて多尺度特徴を作成する；MHSAブランチが多尺度の文脈を処理し、PCMブランチが局所特徴を注入してFFNと融合される前に統合する。
NCはトークン長を維持し、MHSAをPCMと並列に適用し、和で融合し、レイヤー正規化とスキップ接続を伴うFFNを通過する。
モデルは3つのRCに続いて複数のNCを配置し、ViTAE-TとViTAE-Sの構成を用い、ImageNetで標準的なデータ拡張を適用して公平な比較を行う。
学習と評価にはAdamW、コサインスケジューラ、300エポック、8個のV100 GPUを使用する；モデルは同等サイズのCNNとトランスフォーマーと比較される。

実験結果

リサーチクエスチョン

RQ1CNNsからの内在的帰納バイアス（局所性とスケール不変性）をVision Transformersに効果的に統合して、データ効率と多尺度特徴の学習を改善できるか？
RQ2各層内で局所的およびグローバルなモデリングを并列に行う融合アプローチは、Vision Transformerにおける局所性→注意機構の直列構造より優れているか？
RQ3RCsとNCsが、精度、トレーニング効率、および下流の一般化に個別および協調してどのように寄与するか？
RQ4ImageNetおよび小規模データセットに対する、T2T-ViTやDeiTなどのベースラインに対するViTAEのデータ・トレーニング効率はどの程度か？

主な発見

ViTAE-Tは4.8MパラメータでImageNetのTop-1精度75.3%、ViTAE-Sは23.6MパラメータでTop-1 82.0%を達成。
ViTAEはデータ効率とトレーニング効率において優れており、データ・エポックを削減した状態でベースラインのT2T-ViTを上回る。
アブレーション研究は、PCM（局所性）とRC（多尺度）が性能を大幅に向上させ、前融合の融合とBNが最良の結果を提供することを示している。
ViTAEは下流タスク（CIFAR-10/100、iNaturalist、Cars、Flowers、Pets）で強い一般化を示し、多くのベースラインと同等または少ないパラメータ数である。
視覚的分析は、ViTAEがターゲットに対してより正確に注意を集中させ、純粋なトランスフォーマーよりスケールのばらつきに対処できることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。