QUICK REVIEW

[논문 리뷰] MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu, Hossein Talebi|arXiv (Cornell University)|2022. 04. 04.

Visual Attention and Saliency Detection인용 수 26

한 줄 요약

MaxViT는 차원 축 다중(Self-Attention) 메커니즘을 도입하여 차단된 로컬 어텐션과 확장된 글로벌 어텐션을 혼합하고 이를 컨볼루션과 결합해 확장 가능한 계층형 비전 백본을 형성하며 ImageNet 및 COCO에서 최첨단 성능을 달성합니다.

ABSTRACT

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

연구 동기 및 목표

지역 및 글로벌 상호작용을 모두 포착할 수 있는 확장 가능한 비전 아키텍처의 동기를 부여합니다.
블록 로컬 어텐션과 격자 글로벌 어텐션을 컨볼루션과 결합한 트랜스포머 블록을 개발합니다.
MaxViT 블록을 스테이지 전반에 반복하여 간단한 계층형 백본을 구축합니다.
분류, 탐지, 미학, 생성 작업에서 강력한 성능을 보여주는 것을 목표로 합니다.

제안 방법

전체 어텐션을 블록 로컬 어텐션과 격자 글로벌 어텐션으로 분해하고 선형 복잡도를 갖는 다축 자기 주의(Multi-axis Self-Attention, Max-SA)를 도입합니다.
MBConv 블록과 SE를 결합하여 일반화 성능을 향상시키고_MBConv를 조건부 위치 인코딩으로 사용하여 Max-SA를 보강합니다.
4개 스테이지(S0-S4)에 걸쳐 반복되는 MaxViT 블록을 쌓아 계층형 백본 MaxViT를 구성합니다.
스테이지당 서로 다른 블록 수와 채널 크기를 가진 변형(MaxViT-T, -S, -B, -L, -XL)을 제공합니다.
Max-SA가 Swin과 유사한 어텐션 드롭인으로 같은 매개변수 수와 FLOPs를 유지하되 모든 스테이지에서 전역 상호작용을 가능하게 함을 보여줍니다.

실험 결과

연구 질문

RQ1다축 어텐션(로컬 블록 + 글로벌 격자)이 고해상도 비전 작업에서 선형 복잡도로 글로벌 컨텍스트를 제공할 수 있는가?
RQ2간단한 계층형 백본에서 Max-SA와 컨볼루션을 결합하면 기존 비전 트랜스포머 및 하이브리드보다 정확도와 효율성이 향상되는가?
RQ3블록 순서, 연속 대 병렬 어텐션, MBConv의 포함, 수직 배치 등 다양한 아키텍처 선택이 비전 작업에서의 성능에 어떤 영향을 미치는가?
RQ4데이터(ImageNet-1K, ImageNet-21K, JFT-300M)에서 그리고 탐지, 미학 같은 다운스트림 작업에서 MaxViT의 규모 확장이 얼마나 이루어지는가?
RQ5이미지 생성 설정에서 MaxViT가 강력한 생성 성능을 발휘할 수 있는가?

주요 결과

MaxViT는 512x512 미세 조정에서 MaxViT-L의 86.7%를 포함해 설정 전반에서 ImageNet-1K top-1 정확도 최첨단을 달성하며, 224x224에서의 Baseline 없이도 MaxViT-L이 85.17%를 달성합니다.
ImageNet-21K 사전 학습으로 MaxViT-B는 상위 1위 정확도 88.38%에 도달하고 MaxViT-XL은 512x512에서 88.70%에 도달하여 비슷한 크기 이상의 모델보다 우수.
JFT-300M 규모 데이터에서 MaxViT-XL은 89.53% top-1 정확도를 달성하여 대형 데이터 세트로의 강력한 확장을 보여줍니다.
COCO 객체 탐지/인스턴스 세분화에서 MaxViT 백본은 Swin, ConvNeXt, UViT를 다양한 크기에서 능가하며, 기본 레벨에서 특히 큰 차이를 보입니다(예: 비슷한 FLOPs에서 Swin-B 및 UViT-B를 능가하는 MaxViT-S).
이미지 미학(AVA)에서 MaxViT-T는 경쟁력 있는 PLCC/SRCC 점수를 보이고 해상도가 높아질수록 이전 방법보다 개선됩니다.
비조건적 128x128 이미지 생성에서 MaxViT는 HiT 및 다른 기준선보다 파라미터 수가 적은데도 더 나은 FID/IS를 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.