QUICK REVIEW

[논문 리뷰] AS-MLP: An Axial Shifted MLP Architecture for Vision

Dongze Lian, Zehao Yu|arXiv (Cornell University)|2021. 07. 18.

Advanced Neural Network Applications참고 문헌 47인용 수 85

한 줄 요약

AS-MLP는 MLP 프레임워크에서 축 채널 시프트를 도입하여 로컬 의존성을 포착하고, ImageNet 성능 경쟁력 있으며 객체 탐지 및 세분화와 같은 다운스트림 작업으로 확장한다.

ABSTRACT

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features interaction. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, in the same spirit of convolutional neural networks. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Our AS-MLP establishes a strong baseline of MLP-based architecture. Code is available at https://github.com/svip-lab/AS-MLP.

연구 동기 및 목표

MLP 기반 비전 모델에서 글로벌 토큰 혼합만이 아닌 로컬 특징 상호작용을 활용해야 할 필요성을 동기 부여한다.
순수한 MLP 아키텍처 내에서 로컬 수용 필드를 가능하게 하는 경량 축 시프트 메커니즘을 제안한다.
계층적 특징 병합을 갖춘 네 단계의 확장 가능한 AS-MLP 백본을 설계한다.
ImageNet-1K에서 경쟁력 있는 성능과 다운스트림 작업(COCO 탐지, ADE20K 분할)에 대한 경쟁력 있는 전이 성능을 입증한다.
시프트 구성, 패딩, 확장, 연결 방식의 영향을 이해하기 위한 ablation을 제공한다.

제안 방법

수평 및 수직 특징 시프트를 수행한 다음 채널 프로젝션을 통해 로컬 특징 집계를 가능하게 하는 Axial Shifted MLP (AS-MLP) 블록을 도입한다.
Norm 계층, 잔차 연결, 및 MLP 기반 채널 혼합을 사용하여 시프트된 특징을 결합한다.
시프트 연산은 전체 어텐션에 의존하지 않고 서로 다른 공간 위치의 정보를 모아 복잡도를 낮게 유지한다.
패치 분할 및 패치 병합을 통한 Swin과 유사한 4단계 백본을 채택하여 계층적 표현을 형성한다.
시프트 크기, 패딩 방식, 확장 비율, 직렬 대 병렬 연결을 제거 분석하여 효과적인 구성을 식별한다.

실험 결과

연구 질문

RQ1MLP-전용 백본에서 축상(H., V.) 특징 시프트가 CNN이나 윈도우 기반 트랜스포머에 버금가는 경쟁력 있는 로컬 수용 필드를 달성할 수 있는가?
RQ2정확도를 최대화하고 효율성을 유지하기 위한 시프트 크기, 패딩 전략, 연결성(직렬 대 병렬)은 무엇인가?
RQ3AS-MLP의 객체 탐지 및 의미론적 분할과 같은 다운스트림 작업으로의 전이는 트랜스포머 기반 백본에 비해 얼마나 잘 이루어지는가?
RQ4ImageNet-1K에서 AS-MLP 변형들에 대한 모델 크기, FLOPs, 정확도 간의 트레이드오프는 무엇인가?
RQ5유사 자원 제약 하에서 Swin Transformer에 비해 모바일 친화적인 성능을 AS-MLP가 낼 수 있는가?

주요 결과

모델	입력	해상도	Top-1 (%)	파라미터	FLOPs	처리량(이미지/초)
AS-MLP-T	224	224x224	81.3	28M	4.4G	1047.7
AS-MLP-S	224	224x224	83.1	50M	8.5G	619.5
AS-MLP-B	224	224x224	83.3	88M	15.2G	455.2
AS-MLP-B	384	384x384	84.3	88M	44.6G	179.2

AS-MLP는 88M 파라미터와 15.2 GFLOPs(AS-MLP-B, 224x224)로 ImageNet-1K에서 83.3% Top-1 정확도를 달성한다.
AS-MLP-B는 384x384에서 84.3% Top-1을 88M 파라미터와 44.6 GFLOPs로 달성한다.
AS-MLP-S는 50M 파라미터와 8.5 GFLOPs로 83.1% Top-1에 도달한다.
AS-MLP-T는 28M 파라미터와 4.4 GFLOPs로 81.3% Top-1를 달성한다.
모바일 설정에서 AS-MLP (mobile)는 Swin (mobile)보다 Top-1에서 더 나은 성능을 보인다(76.05% 대 75.11%).
AS-MLP는 트랜스포머 기반 비교대비에서 COCO 객체 탐지(예: AS-MLP-B 51.5 APb) 및 ADE20K 분할(AS-MLP-B 49.5 MS mIoU)에서 경쟁력 있는 결과를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.