[论文解读] A Survey on Visual Transformer
本综述在骨干学习、高/中级视觉、低级视觉以及视频任务等方面评估视觉变换模型,分析其优点、局限性与高效变体。
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
研究动机与目标
- Survey the development of vision transformer models categorized by application (backbone, high/mid-level, low-level, video).
- Analyze core components (self-attention, positional encoding, architecture variants) and efficiency methods for real-device deployment.
- Discuss challenges, trade-offs, and potential research directions in vision transformers.
- Provide comparisons of representative models and summarize key findings to guide future research.
提出的方法
- Explain standard transformer components and self-attention equations (Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V).
- Describe ViT and variants that adapt transformers for images (patch embeddings, positional encodings, class token).
- Survey backbone, high/mid-level vision, low-level vision, and video processing models and efficiency approaches.
- Summarize self-supervised and generative/self-supervised pretraining methods (iGPT, MAE, SimMIM) and contrastive learning (MoCo v3).
- Compare CNN+Transformer hybrids and pure transformer backbones with quantitative results where available.
实验结果
研究问题
- RQ1What are the key categories and tasks where vision transformers have been applied?
- RQ2What are the main architectural variants and techniques to improve locality, efficiency, and performance in vision transformers?
- RQ3How do vision transformers compare to CNNs in terms of accuracy, throughput, and data efficiency across tasks?
- RQ4What are the effective pretraining strategies (supervised, self-supervised, generative) for vision transformers?
- RQ5What are the open challenges and future directions for vision transformers?
主要发现
| Model | Params (M) | FLOPs (B) | Throughput (image/s) | Top-1 (%) |
|---|---|---|---|---|
| ResNet-50 | 25.6 | 4.1 | 1226 | 79.1 |
| ResNet-101 | 44.7 | 7.9 | 753 | 79.9 |
| ResNet-152 | 60.2 | 11.5 | 526 | 80.8 |
| EfficientNet-B0 | 5.3 | 0.39 | 2694 | 77.1 |
| EfficientNet-B1 | 7.8 | 0.70 | 1662 | 79.1 |
| EfficientNet-B2 | 9.2 | 1.0 | 1255 | 80.1 |
| EfficientNet-B3 | 12 | 1.8 | 732 | 81.6 |
| EfficientNet-B4 | 19 | 4.2 | 349 | 82.9 |
| DeiT-Ti | 5 | 1.3 | 2536 | 72.2 |
| DeiT-S | 22 | 4.6 | 940 | 79.8 |
| DeiT-B | 86 | 17.6 | 292 | 81.8 |
| T2T-ViT-14 | 21.5 | 5.2 | 764 | 81.5 |
| T2T-ViT-19 | 39.2 | 8.9 | 464 | 81.9 |
| T2T-ViT-24 | 64.1 | 14.1 | 312 | 82.3 |
| PVT-Small | 24.5 | 3.8 | 820 | 79.8 |
| PVT-Medium | 44.2 | 6.7 | 526 | 81.2 |
| PVT-Large | 61.4 | 9.8 | 367 | 81.7 |
| TNT-S | 23.8 | 5.2 | 428 | 81.5 |
| TNT-B | 65.6 | 14.1 | 246 | 82.9 |
| CPVT-S | 23 | 4.6 | 930 | 80.5 |
| CPVT-B | 88 | 17.6 | 285 | 82.3 |
| Swin-T | 29 | 4.5 | 755 | 81.3 |
| Swin-S | 50 | 8.7 | 437 | 83.0 |
| Swin-B | 88 | 15.4 | 278 | 83.3 |
| Twins-SVT-S | 24 | 2.9 | 1059 | 81.7 |
| Twins-SVT-B | 56 | 8.6 | 469 | 83.2 |
| Twins-SVT-L | 99.2 | 15.1 | 288 | 83.7 |
| Shuffle-T | 29 | 4.6 | 791 | 82.5 |
| Shuffle-S | 50 | 8.9 | 450 | 83.5 |
| Shuffle-B | 88 | 15.6 | 279 | 84.0 |
| CMT-S | 25.1 | 4.0 | 563 | 83.5 |
| CMT-B | 45.7 | 9.3 | 285 | 84.5 |
| VOLO-D1 | 27 | 6.8 | 481 | 84.2 |
| VOLO-D2 | 59 | 14.1 | 244 | 85.2 |
| VOLO-D3 | 86 | 20.6 | 168 | 85.4 |
| VOLO-D4 | 193 | 43.8 | 100 | 85.7 |
| VOLO-D5 | 296 | 69.0 | 64 | 86.1 |
- Vision transformers provide competitive or superior performance on many visual benchmarks, approaching or exceeding CNNs in several settings.
- Pure transformer backbones like ViT require large-scale pretraining to surpass CNNs, with data efficiency improved via DeiT and distillation.
- Locality-enhancing variants (TNT, Swin, RegionViT, etc.) and convolution-integrated hybrids (CvT, CvViT, LeViT) improve data efficiency and real-time performance.
- Self-supervised and generative pretraining (iGPT, MAE, SimMIM, MoCo v3) enable strong representations with less labeled data.
- Efficient transformer designs (windowed attention, hierarchical pyramids, NAS-inspired architectures) balance accuracy, FLOPs, and throughput for practical deployment.
- In benchmark comparisons, diverse models (e.g., DeiT, Swin, TNT, ViT variants) achieve top-1 accuracies in the 79–86% range on ImageNet with varying parameter counts and compute.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。