QUICK REVIEW

[论文解读] RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen, Rameswar Panda|arXiv (Cornell University)|Jun 4, 2021

Advanced Neural Network Applications参考文献 53被引用 93

一句话总结

RegionViT 引入了一种金字塔结构的视觉Transformer，使用区域到局部注意力，将区域自注意力与区域到局部注意力相结合，在局部区域内实现全局信息流。

ABSTRACT

Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models are available at https://github.com/ibm/regionvit.

研究动机与目标

通过为视觉任务优化架构来激励和改进 Vision Transformers，而不是直接移植 NLP 风格的设计。
提出一种基于金字塔的区域到局部注意机制，聚合全局区域信息和局部细节交互。
使区域标记能够与局部标记关联，以捕捉全局与局部的上下文线索。

提出的方法

从图像中以多种补丁大小生成区域标记，形成区域表示。
对所有区域标记进行区域自注意力计算，以捕捉全局信息。
在每个区域标记及其相关的局部标记之间执行局部自注意力，以细化局部细节。
在金字塔 Transformer 框架中整合区域与局部注意力，将全局信息传播到局部区域。

实验结果

研究问题

RQ1在标准视觉任务上，金字塔 Vision Transformer 中的区域到局部注意力是否能够超越全局自注意力变体？
RQ2将区域-全局上下文与局部-区域交互耦合在一起，如何影响分类、检测、分割和动作识别的性能？
RQ3RegionViT 框架是否能够实现从全局区域标记到本地标记交互的有效信息交流？
RQ4在区域标记生成中使用多种补丁大小对下游任务有何影响？

主要发现

RegionViT 在多个视觉任务上优于或匹配最先进的 ViT 变体。
这一步两步的区域到局部注意力能够在局部注意力作用域内实现全局信息流向局部区域。
具有区域标记及相关局部标记的金字塔结构在分类、对象/关键点检测、语义分割和动作识别方面提供了具有竞争力的性能。
该方法为在 Vision Transformer 中整合全局与局部上下文线索提供了一种灵活的机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。