QUICK REVIEW

[논문 리뷰] Glance-and-Gaze Vision Transformer

Qihang Yu, Yingda Xia|arXiv (Cornell University)|2021. 06. 04.

Visual Attention and Saliency Detection참고 문헌 46인용 수 33

한 줄 요약

GG-Transformer는 Glance와 Gaze 분기를 도입하여 비전 트랜스포머의 효율적 장거리 모델링과 로컬 컨텍스트를 가능하게 하며 ImageNet, ADE20K, COCO에서 정확도-비용 트레이드오프를 개선한다. 또한 adaptively-dilated self-attention(G-MSA)와 depthwise Gaze 분기로 지역성을 보완한다.

ABSTRACT

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

연구 동기 및 목표

고해상도 비전 작업에서 Dense 예측이 필요한 효율적인 Transformer 설계 촉진.
Glance-and-Gaze Transformer 블록 제안: 긴 범위 주의와 로컬 디테일을 병렬 분기로 결합.
GG-Transformer가 ImageNet, ADE20K, COCO에서 이전 Transformer 대비 우수한 정확도-비용 트레이드오프를 달성함.

제안 방법

Glance 분기: adaptively-dilated 파티션에서 self-attention으로 전역 수용영역을 선형 복잡도로 보존.
Gaze 분기: merged values의 로컬 컨텍스트를 보완하기 위한 depthwise 합성곱.
GG-MSA: 파티션 내에서 merge-and-attend를 수행해 전역 뷰를 유지하면서 계산 감소 (Ω(G-MSA)=4NC^2+2M^2NC).
Gaze 분기 옵션: 고정 커널 크기 또는 적응 커널 크기로 로컬 특징 보정(적응 권장).
Swin-Transformer와 유사한 4단계의 계층적 백본에 완전 병렬 GG-Transformer 블록 구축하여 공정한 비교.

실험 결과

연구 질문

RQ1GG 블록이 글로벌 장거리 모델링을 제곱 비용 없이 가능하게 하면서 로컬 디테일을 보존하는가?
RQ2동일 모델 크기에서 Swin-Transformer 및 다른 ViT 대비 ImageNet, ADE20K, COCO에서 정확도 향상인가?
RQ3Glance와 Gaze 구성 요소가 성능에 어떻게 기여하며 둘 다 조합하는 것이 더 우수한가?
RQ4GG-MSA가 DeiT 같은 기존 ViT 아키텍처에 대한 대체로 실행 가능한가?

주요 결과

모델	이미지 크기	매개변수 (M)	FLOPs (G)	ImageNet Top-1 (%)	mIoU (%)	mIoU(ms+flip) (%)	AP^b (Mask R-CNN)	AP^m (Mask R-CNN)	AP^b (Cascade Mask R-CNN)
GG-T	224	28	4.5	82.0	-	-	-	-	-
GG-S	224	50	8.7	83.4	-	-	-	-	-
Swin-T	224	28	4.5	81.2	-	-	-	-	-
Swin-S	224	50	8.7	83.4	-	-	-	-	-
DeiT-T	224	22	4.6	81.0	-	-	-	-	-
DeiT-S	224	86	17.5	81.8	-	-	-	-	-
GG-T (ours)	224	28	4.5	82.0	-	-	-	-	-
GG-S (ours)	224	50	8.7	83.4	-	-	-	-	-

GG-Transformer는 비슷한 FLOPs 및 매개변수로 ImageNet에서 다른 비전 트랜스포머보다 더 높은 정확도.
GG-T/ GG-S는 같은 모델 크기와 계산 비용에서 Swin-T/S에 맞먹거나 상회하며, ImageNet(224^2)에서 GG-T 82.0%, GG-S 83.4%.
ADE20K에서 GG-T는 46.4% mIoU(단일 스케일) 및 47.2% 테스트-타임 증강으로 ResNet50, PVT-Small, Swin-T 기준을 능가; GG-S 또한 Swin-S를 mIoU에서 상회(48.4%/49.6%).
COCO 객체 인식에서 GG-T와 GG-S 백본은 비슷한 크기의 CNN 및 ViT 백본보다 더 높은 AP를 달성; GG-T는 Mask R-CNN/Cascade Mask R-CNN 구성에서 44.1 AP^b 및 39.9 AP^m으로 Swin-T 대비 비용에서 우수.
분석에서 Glance+Gaze가 Conv 기반 Gaze와 함께 MSA 전용 및 Swin 스타일 로컬 윈도우 방식보다 우수하며, Glance+Gaze(Conv)는 Swin-T 기준에서 ImageNet에서 80.28% top-1 달성.
GG-MSA는 DeiT 백본에도 개선 효과가 있으며 (GG-DeiT-T 73.8%, GG-DeiT-S 80.5%), Swin 유사 아키텍처를 넘어선 다양성 입증.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.