QUICK REVIEW

[논문 리뷰] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Yufei Xu, Qiming Zhang|arXiv (Cornell University)|2021. 06. 07.

Advanced Neural Network Applications참고 문헌 83인용 수 155

한 줄 요약

ViTAE는 병렬 로컬성 및 다중 스케일 축소 셀을 통해 컨볼루션에서 얻은 고유 inductive bias를 비전 트랜스포머에 도입하여 데이터 및 학습 효율이 높은 ImageNet 성능을 달성한다.

ABSTRACT

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.

연구 동기 및 목표

비전 트랜스포머에 고유 inductive bias를 통합해 로컬 및 스케일 인식 특성 학습을 개선하려는 동기를 제시한다.
다중 스케일 맥락을 함께 모델링하고 로컬리티를 병렬로 처리하는 Reduction 및 Normal 셀을 설계하여 Self-Attention과 함께 구성한다.
데이터 및 학습 효율성, 분류 정확도, downstream 일반화의 개선을 입증한다.
합성(convolution 기반 모듈) 및 융합 전략의 기여를 보여주는 Ablation 연구를 제공한다.

제안 방법

다중 스케일 맥락을 포함하는 Pyramid Reduction Module과 다양한 dilation 비율 및 다운샘플링으로 구성된 Reduction Cells(RC)와 MHSA를 Parallel Convolutional Module(PCM)과 융합하는 Normal Cells(NC) 두 가지 셀 타입을 도입한다.
RC는 입력을 4배, 2배, 2배로 다운샘플링하여 토큰 크기를 각각 H/16 x W/16으로 만든다. RC의 출력은 펼쳐져 NC에 들어가기 전 클래스 토큰과 연결/결합된다.
RC의 Pyramid Reduction Module은 다양한 비율의 확장(convolution with dilation)으로 다중 스케일 특징을 생성하고, MHSA 분기는 다중 스케일 맥스를 처리하며, PCM 분기는 로컬 특징을 주입한 후 융합과 FFN으로 진행한다.
NC는 토큰 길이를 유지하고, MHSA를 PCM과 병렬로 적용하며, 합성(addition)으로 융합하고, FFN으로 통과하면서 층 노멀라이제이션과 스킵 연결을 거친다.
모델은 3개의 RC 뒤에 여러 NC를 배치하고, ViTAE-T와 ViTAE-S 구성으로 ImageNet에서 표준 증강을 사용해 공정하게 비교한다.
학습 및 평가에 AdamW, 코사인 스케줄러, 300에포크, 8개의 V100 GPU를 사용하며, 유사한 규모의 CNN 및 트랜스포머와 비교한다.

실험 결과

연구 질문

RQ1CNN의 고유 inductive bias(로컬성 및 스케일-불변성)가 비전 트랜스포머에 효과적으로 통합되어 데이터 효율성 및 다중 스케일 특징 학습을 향상시킬 수 있는가?
RQ2각 레이어에서 로컬+글로벌 모델링을 병렬로 융합하는 접근법이 비전 트랜스포머의 Serial 로컬리티-그(ATTENTION) 구조보다 성능이 우수한가?
RQ3RC와 NC가 개별적으로 또는 함께 정확도, 학습 효율성, 다운스트림 일반화에 어떻게 기여하는가?
RQ4ViTAE의 데이터 및 학습 효율성이 ImageNet 및 더 작은 데이터셋에서 T2T-ViT, DeiT와 같은 기반선 대비 어떤가?

주요 결과

ViTAE-T는 ImageNet에서 Top-1 75.3%의 정확도에 4.8M 파라미터를 달성하고 ViTAE-S는 Top-1 82.0%에 23.6M 파라미터를 달성한다.
ViTAE는 데이터 효율성과 학습 효율에서 우수한 성능을 보이며, 데이터 및 에폭이 감소한 조건에서 기준선 T2T-ViT를 능가한다.
Ablation 연구는 PCM(로컬성)과 RCs(다중 스케일)가 성능을 크게 향상시키며, 융합 전(pre-fusion) 융합 및 BN이 최상의 결과를 제공한다.
ViTAE는 CIFAR-10/100, iNaturalist, Cars, Flowers, Pets 등 다운스트림 태스크에서 파라미터 수가 유사하거나 더 적은 조건으로 강한 일반화를 보인다.
시각적 분석은 ViTAE가 대상에 더 정확하게 주의를 집중하고 스케일 변화에 대해 순수 트랜스포머보다 더 잘 대응함을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.