QUICK REVIEW

[논문 리뷰] S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Tan Yu, Li Xu|arXiv (Cornell University)|2021. 06. 14.

Advanced Neural Network Applications참고 문헌 39인용 수 29

한 줄 요약

S2-MLP는 파라미터가 없는 공간 이동(spatial-shift) 연산을 사용하는 순수 MLP 아키텍처로, ViT와 MLP-Mixer에 비해 FLOPs와 매개변수가 더 적은 상태에서도 ImageNet-1K에서 경쟁력 있는 성능을 달성한다.

ABSTRACT

Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

연구 동기 및 목표

비전 백본에서 컨벌루션과 자기 주의의 대안 필요성(중간 규모 데이터에서의 필요성)을 동기 부여한다.
패치 간 커뮤니케이션을 위한 파라미터 없는 공간-이동 블록을 갖춘 순수 MLP 아키텍처(S2-MLP)를 제안한다.
공간-이동 기반 커뮤니케이션이 ViT 및 MLP-Mixer에 비해 더 적은 매개변수와 FLOPs로 경쟁력 있는 정확도를 제공하는지 입증한다.
ImageNet-1K에서 S2-MLP를 평가하고 깊이, 폭, 이동 방향, 입력 스케일에 따른 성능 차이를 이해하기 위한 제거 실험(ablation)을 수행한다.

제안 방법

이미지 패치를 패치별로 완전연결 임베딩으로 도입한다.
각 블록에 네 개의 완전연결 계층 plus 두 개의 GELU 활성화 및 두 개의 계층 정규화를 포함하는 N개의 S2-MLP 블록을 사용한다.
토큰 혼합을 공간-이동 모듈로 대체하여 채널을 그룹화하고 각 그룹을 한 방향으로 이동시켜 로컬 패치 간 커뮤니케이션을 가능하게 한다.
공간-이동을 이웃 패치 간의 고정-깊이수 DW-유사 이동에 상응하는 파라미터 없는 연산으로 정의한다.
PFL, S2-MLP 블록 및 최종 분류 계층에 대한 복잡도 분석을 제공한다.

실험 결과

연구 질문

RQ1토큰 혼합 없이 순수 MLP 아키텍처가 중간 규모 데이터에서 ImageNet-1K 정확도에 경쟁할 수 있는가?
RQ2파라미터 없는 공간-이동 메커니즘이 충분한 패치 간 커뮤니케이션을 제공하여 MLP-Mixer를 맞추거나 ViT 성능에 근접할 수 있는가?
RQ3깊이(N), 은닉 크기(c), 확장 비(r), 이동 방향, 입력 스케일이 정확도와 효율성에 어떤 영향을 미치는가?
RQ4S2-MLP 아키텍처가 입력 스케일에 불변한가, 그리고 이 점이 토큰-혼합 MLP 모델과 어떻게 비교되는가?

주요 결과

모델	해상도	Top-1	Top-5	매개변수 (M)	FLOPs (B)
S2-MLP-wide	224×224	80.0	94.8	71	14.0
S2-MLP-deep	224×224	80.7	95.4	51	10.5
Mixer-B/16	224×224	76.4	-	59	11.6
FF	224×224	74.9	-	59	11.6
ResMLP-36	224×224	79.7	-	45	8.9
ViT-B/16	384×384	77.9	-	55.5	-

S2-MLP-wide는 ImageNet-1K에서 Top-1 80.0% 및 Top-5 94.8%를 달성했고 매개변수 71M, FLOPs 14B로 MLP-Mixer(Top-1 76.4%)와 비슷한 규모에서 더 나은 성능을 보였다.
S2-MLP-deep은 ImageNet-1K에서 Top-1 80.7% 및 Top-5 95.4%를 달성했고 매개변수 51M, FLOPs 10.5B로 유사한 조건의 ResMLP-36을 능가했다.
S2-MLP의 성능은 보고된 구성을 가진 ViT와 비교해도 경쟁력이 있으며, 더 단순한 아키텍처와 더 낮은 FLOPs/매개변수를 유지한다.
깊이를 1에서 12블록으로 늘리면 정확도가 증가하지만(56.7%에서 87.1% Top-1, ImageNet100 기준), 12–16블록 이상에서는 더 작은 데이터셋에서 과적합으로 성능이 포화되거나 약간 하강한다.
은닉 크기 c를 확장하면 어느 정도까지는 정확도가 오르는 경향이 있으며(c=768에서 87.1% Top-1 달성), 더 큰 c는 매개변수와 FLOPs를 증가시킨다.
네 방향으로 이동하는 것이(기본값) 강한 패치 간 커뮤니케이션을 제공하며, 방향 수를 늘리면 어느 정도까지 성능이 향상되지만 이동을 제거하면 정확도가 크게 떨어진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.