QUICK REVIEW

[论文解读] Transformer in Transformer

Kai Han, An Xiao|arXiv (Cornell University)|Feb 27, 2021

Advanced Neural Network Applications参考文献 51被引用 1,010

一句话总结

TNT 在图像块内引入一个面向视觉词的内部 transformer，以丰富局部特征，相对于 ViT/DeiT 基线，在 FLOPs 增加适中的情况下达到更高的 ImageNet 精度。

ABSTRACT

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$ imes$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$ imes$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

研究动机与目标

提出在视觉变换器中需要保留图像块内的细粒度局部结构的动机。
提出 Transformer-iN-Transformer (TNT) 架构，由内部词级变换器和外部句级变换器组成。
分析与标准变换器相比，TNT 的计算成本和参数开销。
通过广泛实验展示 TNT 在 ImageNet 和下游任务中的效能。

提出的方法

将每个图像块表示为一个视觉句子，并进一步将其分解为视觉词。
对内部 transformer 以建模每个句子内视觉词之间的关系。
使用外部 transformer 来建模跨图像的句子嵌入之间的关系。
在外部 transformer 之前，通过线性投影将词嵌入加入到相应的句子嵌入中。
采用与 ViT 类似的训练，配合 DeiT 风格的数据增强，以及句子和词的可学习位置编码。

实验结果

研究问题

RQ1建模补丁内（词级）关系是否能比仅使用补丁级方法提升视觉 Transformer 的性能？
RQ2内部 transformer 的规模、每个补丁的词数以及位置编码对准确性与效率有何影响？
RQ3TNT 是否在 ImageNet 及下游任务上实现比 ViT/DeiT 基线更好的精度/ FLOPs 权衡？

主要发现

模型	分辨率	参数量（M）	FLOPs（B）	Top-1	Top-5
ResNet-50	224 × 224	25.6	4.1	76.2	92.9
ResNet-152	224 × 224	60.2	11.5	78.3	94.1
RegNetY-8GF	224 × 224	39.2	8.0	79.9	-
RegNetY-16GF	224 × 224	83.6	15.9	80.4	-
EfficientNet-B3	300 × 300	12.0	1.8	81.6	94.9
EfficientNet-B4	380 × 380	19.0	4.2	82.9	96.4
DeiT-Ti	224 × 224	5.7	1.3	72.2	-
TNT-Ti	224 × 224	6.1	1.4	73.9	91.9
DeiT-S	224 × 224	22.1	4.6	79.8	-
PVT-Small	224 × 224	24.5	3.8	79.8	-
PVT-Medium	224 × 224	40.0	6.7	81.2	-
TNT-S	224 × 224	23.8	5.2	81.5	95.7
ViT-B/16	384 × 384	86.4	55.5	77.9	-
DeiT-B	224 × 224	86.4	17.6	81.8	-
T2T-ViT_t-24	224 × 224	63.9	13.2	82.2	-
TNT-B	224 × 224	65.6	14.1	82.9	96.3

TNT-S 在 ImageNet 上达到 81.5% Top-1，约比在类似计算下的 DeiT-S 高出 1.7%。
与标准 transformer 块相比，TNT 块的 FLOPs 增加约 1.14 倍、参数增加约 1.08 倍，同时精度有所提升。
TNT 在 ImageNet 上优于若干基于 transformer 的和 CNN 基线，并且能很好地迁移到下游数据集（CIFAR、Flowers、Pets、iNat）。
句子和词的位置编码对准确度有显著提升；同时使用两者可在 TNT-S 上获得 81.5% Top-1。
内部 transformer 的头数配置（2-4 头）和默认的单词数 m=16 提供最佳性能（例如 4 个内部头时达到 81.5%）。
SE 模块可略微将 TNT-S 的准确度提升约 0.2 个百分点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。