QUICK REVIEW

[论文解读] A Survey on Visual Transformer

Kai Han, Yunhe Wang|arXiv (Cornell University)|Dec 23, 2020

Advanced Neural Network Applications参考文献 70被引用 220

一句话总结

本综述在骨干学习、高/中级视觉、低级视觉以及视频任务等方面评估视觉变换模型，分析其优点、局限性与高效变体。

ABSTRACT

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.

研究动机与目标

Survey the development of vision transformer models categorized by application (backbone, high/mid-level, low-level, video).
Analyze core components (self-attention, positional encoding, architecture variants) and efficiency methods for real-device deployment.
Discuss challenges, trade-offs, and potential research directions in vision transformers.
Provide comparisons of representative models and summarize key findings to guide future research.

提出的方法

Explain standard transformer components and self-attention equations (Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V).
Describe ViT and variants that adapt transformers for images (patch embeddings, positional encodings, class token).
Survey backbone, high/mid-level vision, low-level vision, and video processing models and efficiency approaches.
Summarize self-supervised and generative/self-supervised pretraining methods (iGPT, MAE, SimMIM) and contrastive learning (MoCo v3).
Compare CNN+Transformer hybrids and pure transformer backbones with quantitative results where available.

实验结果

研究问题

RQ1What are the key categories and tasks where vision transformers have been applied?
RQ2What are the main architectural variants and techniques to improve locality, efficiency, and performance in vision transformers?
RQ3How do vision transformers compare to CNNs in terms of accuracy, throughput, and data efficiency across tasks?
RQ4What are the effective pretraining strategies (supervised, self-supervised, generative) for vision transformers?
RQ5What are the open challenges and future directions for vision transformers?

主要发现

Model	Params (M)	FLOPs (B)	Throughput (image/s)	Top-1 (%)
ResNet-50	25.6	4.1	1226	79.1
ResNet-101	44.7	7.9	753	79.9
ResNet-152	60.2	11.5	526	80.8
EfficientNet-B0	5.3	0.39	2694	77.1
EfficientNet-B1	7.8	0.70	1662	79.1
EfficientNet-B2	9.2	1.0	1255	80.1
EfficientNet-B3	12	1.8	732	81.6
EfficientNet-B4	19	4.2	349	82.9
DeiT-Ti	5	1.3	2536	72.2
DeiT-S	22	4.6	940	79.8
DeiT-B	86	17.6	292	81.8
T2T-ViT-14	21.5	5.2	764	81.5
T2T-ViT-19	39.2	8.9	464	81.9
T2T-ViT-24	64.1	14.1	312	82.3
PVT-Small	24.5	3.8	820	79.8
PVT-Medium	44.2	6.7	526	81.2
PVT-Large	61.4	9.8	367	81.7
TNT-S	23.8	5.2	428	81.5
TNT-B	65.6	14.1	246	82.9
CPVT-S	23	4.6	930	80.5
CPVT-B	88	17.6	285	82.3
Swin-T	29	4.5	755	81.3
Swin-S	50	8.7	437	83.0
Swin-B	88	15.4	278	83.3
Twins-SVT-S	24	2.9	1059	81.7
Twins-SVT-B	56	8.6	469	83.2
Twins-SVT-L	99.2	15.1	288	83.7
Shuffle-T	29	4.6	791	82.5
Shuffle-S	50	8.9	450	83.5
Shuffle-B	88	15.6	279	84.0
CMT-S	25.1	4.0	563	83.5
CMT-B	45.7	9.3	285	84.5
VOLO-D1	27	6.8	481	84.2
VOLO-D2	59	14.1	244	85.2
VOLO-D3	86	20.6	168	85.4
VOLO-D4	193	43.8	100	85.7
VOLO-D5	296	69.0	64	86.1

Vision transformers provide competitive or superior performance on many visual benchmarks, approaching or exceeding CNNs in several settings.
Pure transformer backbones like ViT require large-scale pretraining to surpass CNNs, with data efficiency improved via DeiT and distillation.
Locality-enhancing variants (TNT, Swin, RegionViT, etc.) and convolution-integrated hybrids (CvT, CvViT, LeViT) improve data efficiency and real-time performance.
Self-supervised and generative pretraining (iGPT, MAE, SimMIM, MoCo v3) enable strong representations with less labeled data.
Efficient transformer designs (windowed attention, hierarchical pyramids, NAS-inspired architectures) balance accuracy, FLOPs, and throughput for practical deployment.
In benchmark comparisons, diverse models (e.g., DeiT, Swin, TNT, ViT variants) achieve top-1 accuracies in the 79–86% range on ImageNet with varying parameter counts and compute.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。