QUICK REVIEW

[논문 리뷰] Transformer in Transformer

Kai Han, An Xiao|arXiv (Cornell University)|2021. 02. 27.

Advanced Neural Network Applications참고 문헌 51인용 수 1,010

한 줄 요약

TNT는 이미지 패치 내부의 시각적 단어에 대한 내부 트랜스포머를 도입해 로컬 특징을 풍부하게 하고, ViT/DeiT 기반에 비해 계산량 증가가 비교적 작으면서 ImageNet 정확도를 높인다.

ABSTRACT

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$ imes$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$ imes$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

연구 동기 및 목표

시각 트랜스포머에서 이미지 패치 내의 미세한 로컬 구조를 보존할 필요성을 제기한다.
내부 단어 레벨 트랜스포머와 외부 문장 레벨 트랜스포머를 구성하는 Transformer-iN-Transformer(TNT) 아키텍처를 제안한다.
표준 트랜스포머와 비교한 TNT의 계산 비용 및 파라미터 오버헤드를 분석한다.
광범위한 실험을 통해 ImageNet 및 다운스트림 태스크에서 TNT의 효용성을 입증한다.

제안 방법

각 이미지 패치를 시각적 문장으로 표현하고 이를 더 시각적 단어로 나눈다.
각 문장 내 시각적 단어들 간의 관계를 모델링하기 위해 내부 트랜스포머를 적용한다.
이미지 전체에서 문장 임베딩 간의 관계를 모델링하기 위해 외부 트랜스포머를 사용한다.
외부 트랜스포머 전에 선형 투사(linear projection)를 통해 단어 임베딩을 해당 문장 임베딩에 합친다.
문장과 단어에 대해 학습 가능한 위치 인코딩을 사용하고 DeiT 스타일의 증강을 포함한 표준 ViT 유사 학습을 적용한다.

실험 결과

연구 질문

RQ1패치 내(단어 수준) 관계를 모델링하는 것이 패치 수준 접근만으로는 얻기 어려운 비주얼 트랜스포머의 성능 향상을 가져오는가?
RQ2내부 트랜스포머의 크기, 패치당 단어 수, 위치 인코딩이 정확도와 효율성에 미치는 영향은 무엇인가?
RQ3TNT가 ImageNet 및 다운스트림 태스크에서 ViT/DeiT 베이스라인보다 더 나은 정확도/계산(FLOPs) 트레이드를 달성할 수 있는가?

주요 결과

모델	해상도	파라미터 (M)	FLOPs (B)	Top-1	Top-5
ResNet-50	224 × 224	25.6	4.1	76.2	92.9
ResNet-152	224 × 224	60.2	11.5	78.3	94.1
RegNetY-8GF	224 × 224	39.2	8.0	79.9	-
RegNetY-16GF	224 × 224	83.6	15.9	80.4	-
EfficientNet-B3	300 × 300	12.0	1.8	81.6	94.9
EfficientNet-B4	380 × 380	19.0	4.2	82.9	96.4
DeiT-Ti	224 × 224	5.7	1.3	72.2	-
TNT-Ti	224 × 224	6.1	1.4	73.9	91.9
DeiT-S	224 × 224	22.1	4.6	79.8	-
PVT-Small	224 × 224	24.5	3.8	79.8	-
PVT-Medium	224 × 224	40.0	6.7	81.2	-
TNT-S	224 × 224	23.8	5.2	81.5	95.7
ViT-B/16	384 × 384	86.4	55.5	77.9	-
DeiT-B	224 × 224	86.4	17.6	81.8	-
T2T-ViT_t-24	224 × 224	63.9	13.2	82.2	-
TNT-B	224 × 224	65.6	14.1	82.9	96.3

TNT-S는 ImageNet에서 top-1 81.5%를 달성하며, 비슷한 계산량에서 DeiT-S보다 약 1.7% 포인트 더 높다.
TNT 블록은 표준 트랜스포머 블록에 비해 FLOPs는 약 1.14배, 파라미터 수는 약 1.08배 증가시키고 정확도는 향상된다.
TNT는 ImageNet에서 여러 트랜스포머 기반 및 CNN 베이스라인을 능가하고, 다운스트림 데이터셋(CIFAR, Flowers, Pets, iNat)으로의 전이도 우수하다.
문장과 단어 모두에 대한 위치 인코딩이 정확도를 크게 높이고, 두 가지를 모두 사용할 때 TNT-S에서 81.5%의 top-1을 얻는다.
내부 트랜스포머 헤드 구성(2-4헤드) 및 기본 단어 수 m=16이 최적의 성능을 제공한다(예: 4개의 내부 헤드로 81.5%).
SE 모듈은 TNT-S 정확도를 약 0.2pp 정도 소폭 향상시킬 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.