QUICK REVIEW

[Paper Review] RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen, Rameswar Panda|arXiv (Cornell University)|Jun 4, 2021

Advanced Neural Network Applications53 references93 citations

TL;DR

RegionViT introduces a pyramid-structured vision transformer that uses regional-to-local attention, combining regional self-attention with regional-to-local attention to enable global information flow within a local region.

ABSTRACT

Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models are available at https://github.com/ibm/regionvit.

Motivation & Objective

Motivate and improve Vision Transformers by designing architecture optimized for vision tasks rather than directly importing NLP-style designs.
Propose a pyramid-based region-to-local attention mechanism that aggregates global regional information and local detailed interactions.
Enable regional tokens to be associated with local tokens to capture both global and local contextual cues.

Proposed method

Generate regional tokens from the image at multiple patch sizes to form a regional representation.
Compute regional self-attention across all regional tokens to capture global information.
Perform local self-attention between each regional token and its associated local tokens to refine local details.
Integrate regional and local attention within a pyramid Transformer framework to propagate global information to local regions.

Experimental results

Research questions

RQ1Can regional-to-local attention in a pyramid Vision Transformer outperform global self-attention variants on standard vision tasks?
RQ2How does coupling regional-global context with local-region interactions affect performance in classification, detection, segmentation, and action recognition?
RQ3Does the RegionViT framework enable effective information exchange from global regional tokens to localized token interactions?
RQ4What is the impact of using multiple patch sizes for regional token generation on downstream tasks?

Key findings

RegionViT outperforms or matches state-of-the-art ViT variants across several vision tasks.
The two-step regional-to-local attention enables global information flow to local regions despite local attention scope.
The pyramid structure with regional tokens and associated local tokens provides competitive performance on classification, object/keypoint detection, semantic segmentation, and action recognition.
The approach provides a flexible mechanism to integrate global and local contextual cues within Vision Transformers.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.