QUICK REVIEW

[论文解读] A survey of the Vision Transformers and their CNN-Transformer based Variants

Asifullah Khan, Zunaira Rauf|arXiv (Cornell University)|May 17, 2023

Advanced Neural Network Applications被引用 13

一句话总结

该论文综述 Vision Transformers 及其 CNN-Transformer 混合结构，提出混成体系的分类法并讨论它们的注意力机制、位置嵌入、多尺度处理，以及卷积组件。

ABSTRACT

Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large learning capacity. However, they may suffer from limited generalization as they do not tend to model local correlation in images. Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations. These hybrid vision transformers, also referred to as CNN-Transformer architectures, have demonstrated remarkable results in vision applications. Given the rapidly growing number of hybrid vision transformers, it has become necessary to provide a taxonomy and explanation of these hybrid architectures. This survey presents a taxonomy of the recent vision transformer architectures and more specifically that of the hybrid vision transformers. Additionally, the key features of these architectures such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. In contrast to the previous survey papers that are primarily focused on individual vision transformer architectures or CNNs, this survey uniquely emphasizes the emerging trend of hybrid vision transformers. By showcasing the potential of hybrid vision transformers to deliver exceptional performance across a range of computer vision tasks, this survey sheds light on the future directions of this rapidly evolving architecture.

研究动机与目标

通过突出 Vision Transformers 在计算机视觉中作为 CNN 的替代品兴起来激发本研究。
提供最近的 vision transformer 架构的分类法，重点是混合 CNN-Transformer 的变体。
讨论核心特征，如注意力机制、位置嵌入、多尺度处理，以及卷积组件。
通过聚焦混合架构及其在不同视觉任务中的实际性能来对比先前的综述。

提出的方法

对近期 Vision Transformer 与 CNN-Transformer 混合模型进行系统的文献综合。
构建分类法，以基于 CNN 与自注意力的整合方式对架构进行分类。
对架构特征进行批判性讨论，包括注意力、位置嵌入、多尺度处理和卷积运算。

实验结果

研究问题

RQ1视觉 Transformer 及其 CNN-Transformer 混合的主要架构家族有哪些？
RQ2混合架构如何将卷积与自注意力结合以捕捉局部和全局的图像结构？
RQ3常见的设计选项（如位置嵌入、多尺度处理）及其对性能的影响？
RQ4混合视觉 Transformer 的未来方向和尚待解决的挑战有哪些？

主要发现

混合视觉 Transformer 通过 CNN-Transformer 整合有效地利用局部与全局的图像表征。
注意力机制、位置嵌入和多尺度处理是混合架构的核心，并影响性能。
所综述的文献强调从纯变换器或 CNN 向混合设计在多种视觉任务中的转变。
本文提供分类法和综合，以指导未来对混合视觉 Transformers 的研究和应用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。