QUICK REVIEW

[論文レビュー] Transformer in Transformer

Kai Han, An Xiao|arXiv (Cornell University)|Feb 27, 2021

Advanced Neural Network Applications参考文献 51被引用数 1,010

ひとこと要約

TNTは画像パッチ内の視覚語に対する内側のトランスフォーマを導入し、局所特徴を豊かにする。ViT/DeiTの基準と比較して、妥当な FLOPs の増加で ImageNet 精度を向上させる。

ABSTRACT

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$ imes$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$ imes$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

研究の動機と目的

画像パッチ内の細かな局所構造を視覚的トランスフォーマに残す必要性を動機づける。
内側の語レベルと外側の文レベルのトランスフォーマから成る Transformer-iN-Transformer (TNT) アーキテクチャを提案する。
TNT の計算コストと標準的なトランスフォーマと比較したパラメータ負荷を分析する。
広範な実験を通じて ImageNet および下流タスクへの TNT の有効性を示す。

提案手法

各画像パッチを視覚的な文として表現し、さらにそれを視覚的語に分割する。
各文内の視覚語間の関係をモデル化するために内側のトランスフォーマを適用する。
画像全体で文埋め込み間の関係をモデル化するために外側のトランスフォーマを使用する。
外側のトランスフォーマの前に、語の埋め込みを対応する文の埋め込みに線形射影で追加する。
文と語の位置エンコーディングを学習可能としたViT風の訓練とDeiT風の拡張を使用する。

実験結果

リサーチクエスチョン

RQ1 intra-patch（語レベル）関係をモデル化することは、パッチレベルのみのアプローチより視覚トランスフォーマの性能を改善するか。
RQ2内側トランスフォーマのサイズ、パッチあたりの語数、位置エンコーディングが精度と効率に与える影響は何か。
RQ3TNT は ImageNet および下流タスクで ViT/DeiT の基準より良い精度/ FLOPs のトレードオフを達成できるか。

主な発見

Model	Resolution	Params (M)	FLOPs (B)	Top-1	Top-5
ResNet-50	224 × 224	25.6	4.1	76.2	92.9
ResNet-152	224 × 224	60.2	11.5	78.3	94.1
RegNetY-8GF	224 × 224	39.2	8.0	79.9	-
RegNetY-16GF	224 × 224	83.6	15.9	80.4	-
EfficientNet-B3	300 × 300	12.0	1.8	81.6	94.9
EfficientNet-B4	380 × 380	19.0	4.2	82.9	96.4
DeiT-Ti	224 × 224	5.7	1.3	72.2	-
TNT-Ti	224 × 224	6.1	1.4	73.9	91.9
DeiT-S	224 × 224	22.1	4.6	79.8	-
PVT-Small	224 × 224	24.5	3.8	79.8	-
PVT-Medium	224 × 224	40.0	6.7	81.2	-
TNT-S	224 × 224	23.8	5.2	81.5	95.7
ViT-B/16	384 × 384	86.4	55.5	77.9	-
DeiT-B	224 × 224	86.4	17.6	81.8	-
T2T-ViT_t-24	224 × 224	63.9	13.2	82.2	-
TNT-B	224 × 224	65.6	14.1	82.9	96.3

TNT-S は ImageNet で Top-1 が 81.5%、同程度の計算量で DeiT-S より約 1.7%高い。
TNT ブロックは標準トランスフォーマブロックに比べて FLOPs が約 1.14 倍、パラメータが約 1.08 倍増となり、精度が向上する。
TNT は ImageNet でいくつかのトランスフォーマーベースおよび CNN ベースの基準を上回り、下流データセット（CIFAR、Flowers、Pets、iNat）へ良好に転移する。
文と語の位置エンコーディングは精度を大幅に向上させる。両方を用いると TNT-S で 81.5%/top-1 を達成する。
内側のトランスフォーマ頭部構成（2-4 heads）とデフォルトの語数 m=16 が最適な性能を提供する（例：4 個の内側ヘッドで 81.5%）。
SE モジュールは TNT-S の精度を約 0.2 ポイント程度微増させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。