Skip to main content
QUICK REVIEW

[Paper Review] Transformer in Transformer

Kai Han, An Xiao|arXiv (Cornell University)|Feb 27, 2021
Advanced Neural Network Applications51 references1,010 citations
TL;DR

TNT introduces an inner transformer over visual words inside image patches to enrich local features, achieving higher ImageNet accuracy with modest FLOPs increases compared to ViT/DeiT baselines.

ABSTRACT

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$ imes$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$ imes$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

Motivation & Objective

  • Motivate the need to preserve fine-grained local structure within image patches for visual transformers.
  • Propose the Transformer-iN-Transformer (TNT) architecture composing inner word-level and outer sentence-level transformers.
  • Analyze the computational cost and parameter overhead of TNT compared with standard transformers.
  • Demonstrate TNT's effectiveness on ImageNet and downstream tasks through extensive experiments.

Proposed method

  • Represent each image patch as a visual sentence and further divide it into visual words.
  • Apply an inner transformer to model relations among visual words within each sentence.
  • Use an outer transformer to model relations among sentence embeddings across the image.
  • Add the word embeddings into the corresponding sentence embedding via a linear projection before the outer transformer.
  • Employ standard ViT-like training with DeiT-style augmentations and learnable position encodings for sentences and words.

Experimental results

Research questions

  • RQ1Does modeling intra-patch (word-level) relationships improve visual Transformer performance over patch-level approaches alone?
  • RQ2What is the impact of inner transformer size, number of words per patch, and position encodings on accuracy and efficiency?
  • RQ3Can TNT achieve better accuracy/FLOPs trade-offs than ViT/DeiT baselines on ImageNet and downstream tasks?

Key findings

  • TNT-S achieves 81.5% top-1 on ImageNet, about 1.7% higher than DeiT-S at similar compute.
  • TNT blocks yield ~1.14x FLOPs and ~1.08x parameters increase relative to a standard transformer block, with improved accuracy.
  • TNT outperforms several transformer-based and CNN baselines on ImageNet and transfers well to downstream datasets (CIFAR, Flowers, Pets, iNat).
  • Position encodings for both sentences and words significantly boost accuracy; using both yields 81.5% top-1 on TNT-S.
  • Inner transformer head configurations (2-4 heads) and default word count m=16 provide optimal performance (e.g., 81.5% with 4 inner heads).
  • SE module can slightly improve TNT-S accuracy by around 0.2 percentage points.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.