QUICK REVIEW

[논문 리뷰] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, Enze Xie|arXiv (Cornell University)|2021. 02. 24.

Advanced Neural Network Applications인용 수 58

한 줄 요약

이 논문은 Pyramid Vision Transformer(PVT)를 소개한다. 이는 다중 스케일 특성 피라미드와 공간 축소 어텐션을 갖춘 합성된 컨볼루션 없는 Transformer 백본으로, 높은 해상도 밀집 예측과 탐지, 분할 및 분류 작업에서 경쟁력 있는 성능을 가능하게 한다.

ABSTRACT

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at https://github.com/whai362/PVT.

연구 동기 및 목표

합성곱 없이 밀집 예측 작업(탐지, 분할)에 적합한 순수 Transformer 백본 개발.
다중 스케일의 고해상도 표현을 제공하기 위한 피라미드 기반 특징 계층 구조 도입.
공간 축소 어텐션(SRA)을 통해 고해상도 어텐션 시 계산 및 메모리 비용 감소.
객체 탐지, 인스턴스/세멘틱 분할, 이미지 분류 전반에서 PVT의 대체 백본으로서의 효과성 입증

제안 방법

입력을 미세한 패치(4x4)로 표현하고 다중 스케일 특징 맵(F1..F4)을 생성하는 네 단계 피라미드를 구성한다.
각 단계에서 패치 임베딩을 사용하여 특징 맵 해상도를 점진적으로 축소(4x, 8x, 16x, 32x 스트라이드)한다.
표준 멀티헤드 어텐션을 공간 축소 어텐션(SRA)으로 대체하여 어텐션 전에 K와 V를 축소함으로써 계산 및 메모리를 줄인다.
특정 Li, Ei, Ni, Ri 구성을 갖는 각 단계마다 Transformer 인코더를 공유하여 정확도와 효율 사이의 균형을 맞춘다.
DETR와의 통합을 통해 PVT를 탐지용으로, 표준 헤드와의 통합을 통해 분할/탐지 작업용으로 엔드 투 엔드 파이프라인을 가능하게 한다.

(a) CNNs: VGG [ 54 ] , ResNet [ 22 ] , etc .

실험 결과

연구 질문

RQ1피라미드 다중 스케일 구조를 갖춘 순수 Transformer 백본이 밀집 예측 작업에 대해 CNN 백본을 대체할 수 있는가?
RQ2고해상도 특징 맵을 효율적으로 처리하기 위해 어텐션 메커니즘을 어떻게 재설계할 수 있는가?
RQ3피라미드 Transformer를 사용했을 때 CNN이나 ViT 대비 밀집 예측 벤치마크에서 정확도와 효율성의 트레이드오프는 어떻게 되는가?

주요 결과

방법	매개변수 수 (#Param) (M)	GFLOPs	Top-1 오류 (%)
ResNet18*	11.7	1.8	30.2
ResNet18	11.7	1.8	31.5
DeiT-Tiny/16	5.7	1.3	27.8
PVT-Tiny	13.2	1.9	24.9
ResNet50*	25.6	4.1	23.9
ResNet50	25.6	4.1	21.5
ResNeXt50-32x4d*	25.0	4.3	22.4
ResNeXt50-32x4d	25.0	4.3	20.5
T2T-ViT t-14	22.0	6.1	19.3
TNT-S	23.8	5.2	18.7
DeiT-Small/16	22.1	4.6	20.1
PVT-Small	24.5	3.8	20.2
ResNet101*	44.7	7.9	22.6
ResNet101	44.7	7.9	20.2
ResNeXt101-32x4d*	44.2	8.0	21.2
ResNeXt101-32x4d	44.2	8.0	19.4
T2T-ViT t-19	39.0	9.8	18.6
ViT-Small/16	48.8	9.9	19.2
PVT-Medium	44.2	6.7	18.8
ViT-Base/16	86.6	17.6	18.2
PVT-Large	61.4	9.8	18.3

PVT 변형은 RetinaNet을 사용한 COCO 객체 탐지에서 비슷한 매개변수 수의 CNN 백본보다 더 높은 성능을 보인다(예: PVT-Small 40.4 AP 대 ResNet50 36.3 AP).
PVT-Large는 ResNeXt101-64x4d보다 매개변수가 30% 적은 데도 COCO에서 42.6 AP를 달성한다.
인스턴스 분할의 경우 PVT-Tiny/Small/Medium이 COCO의 APm에서 ResNet-18/50/101 기본 모델보다 우수하며 FLOPs는 비슷하다.
ADE20K에서 시맨틱 분할은 PVT 백본이 CNN 백본보다 더 높은 mIoU를 보이며, PVT-Large는 42.1 mIoU에 다중 스케일 테스트로 44.8을 달성한다.
PVT를 이용한 순수 Transformer DETR 파이프라인(PVT+DETR)은 COCO val2017에서 34.7 AP를 달성하며 ResNet50 기반 DETR보다 우수하다.
ImageNet 분류에서 PVT 모델은 ViT/DeiT 및 전통적인 CNN과 경쟁력이 있으며, 밀집 예측 작업에서의 이점이 더 두드러진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.