QUICK REVIEW

[논문 리뷰] ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yu-Bin Yang|arXiv (Cornell University)|2021. 05. 28.

Advanced Neural Network Applications참고 문헌 36인용 수 148

한 줄 요약

ResT는 EMSA 주의(attention)와 메모리 효율적인 다중 스케일 Vision Transformer 백본, 유연한 공간 위치 인코딩, 겹치는 패치 임베딩을 도입하여 ImageNet 및 COCO에서 강력한 결과를 달성합니다.

ABSTRACT

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

연구 동기 및 목표

CNN의 지역성(Locality)과 Transformer의 글로벌 추론을 결합한 이미지 인식용 범용 백본 아키텍처를 개발한다.
멀티헤드 다양성을 유지하면서 셀프 어텐션의 메모리 및 계산 비용을 줄인다.
밀집 예측(dense prediction) 작업을 위한 유연한 입력 크기와 다중 스케일 피처 맵을 가능하게 한다.
ImageNet-1k 분류 및 객체 탐지와 인스턴스 세분화와 같은 다운스트림 태스크에서 ResT를 검증한다.
유사한 모델 규모에서 ResT가 유사한 백본보다 성능이 우수함을 입증한다.

제안 방법

깊이 방향 합성(Convolution)을 사용하여 공간 토큰을 압축하고 어텐션 헤드 간 상호작용을 투영하는 Efficient Multi-head Self-Attention (EMSA)을 도입한다.
고정된 패치 토큰화를 겹치는 컨볼루션 기반 패치 임베딩으로 대체하여 다중 스케일 피처 피라미드를 구축한다.
입력 크기의 가변성을 보정하기 위해 공간 주의력(PA)으로 위치 인코딩을 정의하여 보간(interpolation)이나 파인튜닝 없이 처리한다.
EMSA 내에 1×1 컨볼루션 + Instance Normalization을 도입하여 헤드 다양성을 회복하고 학습을 안정시킨다.
단계별 패치 임베딩을 사용하여 채널 차원을 점진적으로 확장하고 공간 해상도를 감소시켜 ResT 유사 백본을 형성한다.
다운스트림 프레임워크에서 프리-노멀라이제이션을 채택하고 ImageNet-1k 평가를 위한 간단한 글로벌 평균 풀링 분류기를 사용한다.

실험 결과

연구 질문

RQ1성능 저하 없이 Vision Transformer 백본에서 셀프 어텐션을 메모리 효율적으로 만드는 방법은?
RQ2공간 조건부 위치 인코딩이 밀집 예측을 위한 유연한 입력 크기 및 다중 스케일 표현을 가능하게 할 수 있는가?
RQ3겹치는 패치 임베딩이 표준 토큰화에 비해 저수준 특징 포착 및 전체 정확도를 향상시키는가?
RQ4유사 비용의 백본에 비해 ImageNet-1k 및 COCO 객체 탐지/인스턴스 분할에서 ResT 백본이 어떤 성능 향상을 보이는가?

주요 결과

모델	#파라미터 (M)	FLOPs (G)	처리량 (images/s)	Top-1 (%)	Top-5 (%)
ResT-Lite	10.49	1.4	1246	77.2 (↑7.5)	93.7 (↑4.6)
ResT-Small	13.66	1.9	1043	79.6 (↑9.9)	94.9 (↑5.8)
ResT-Base	30.28	4.3	673	81.6 (↑2.6)	95.7 (↑1.3)
ResT-Large	51.63	7.9	429	83.6 (↑3.3)	96.3 (↑1.1)

ResT-Small은 1.9G FLOPs 및 13.66M 파라미터로 ImageNet-1k에서 Top-1 정확도 79.6%를 달성한다.
ResT-Large은 7.9G FLOPs 및 51.63M 파라미터로 Top-1 83.6%에 도달하며 비용이 유사한 Swin 변형보다 성능이 우수하다.
COCO 객체 탐지 RetinaNet에서 ResT-Small은 PVT-T 대비 AP를 3.6포인트 향상시킨다(40.3 대 36.7).
ResT-Base는 PVT-S 대비 AP를 1.6포인트 향상시킨다(42.0 대 40.4).
ResT-Large는 Mask RCNN 기반 인스턴스 분할에서 (APbox 41.6, APmask 38.7) 유사 예산의 PVT-S 및 Swin 변형에 비해 큰 이점을 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.