QUICK REVIEW

[논문 리뷰] Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Qi Han, Zejia Fan|arXiv (Cornell University)|2021. 06. 08.

Advanced Neural Network Applications참고 문헌 75인용 수 24

한 줄 요약

이 논문은 로컬 비전 트랜스포머의 로컬 어텐션을 스파arsity 연결성, 가중치 공유, 동적 가중치 계산을 통해 분석하는 채널별 국소 연결 레이어로 재해석한다. 깊이 분리 컨볼루션과 동적 가중치 변형을 사용하는 모델들이 ImageNet, COCO, ADE 벤치마크에서 Swin Transformer와 동등하거나 略로 뛰어난 성능을 기록함으로써, 정규화와 동적 가중치가 네트워크 용량을 크게 향상시킨다는 것을 보여준다.

ABSTRACT

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity.

연구 동기 및 목표

로컬 비전 트랜스포머의 성공에 기여하는 인덕티브 바이어스와 정규화 메커니즘을 이해하는 것.
로컬 어텐션 내 스파arsity 연결성과 가중치 공유가 모델 용량과 일반화에 어떻게 기여하는지 분석하는 것.
동적 가중치 계산이 계산 복잡도를 증가시키지 않으면서 성능 향상에 기여하는 역할을 분석하는 것.
로컬 어텐션과 깊이 분리 컨볼루션을 비교하고, 시각 작업에서 성능 상등성을 평가하는 것.
정규화와 동적 가중치가 로컬 비전 트랜스포머의 뛰어난 성능을 이끌어내는 핵심 요소임을 경험적으로 검증하는 것.

제안 방법

네트워크 정규화 관점에서 분석이 가능하도록 로컬 어텐션을 채널별 국소 연결 레이어로 재구성하는 것.
스파arsity 연결성을 통해 로컬 어텐션을 분석하는 것 — 각 위치는 로컬 창 내의 공간 이웃에만 연결되고 채널 간 연결은 존재하지 않음.
채널 간 또는 채널 그룹 내에서 연결 가중치를 공유함으로써 가중치 공유를 도입하여 깊이 분리 컨볼루션을 모방하는 것.
각 이미지 인스턴스별로 연결 가중치를 예측하는 동적 가중치 메커니즘을 제안하여 적응형 특징 모델링을 가능하게 하는 것.
Swin Transformer와의 비교를 위해 깊이 분리 컨볼루션과 동적 가중치 변형 기반 모델을 베이스라인으로 구현하는 것.
성능과 효율성을 평가하기 위해 ImageNet 분류, COCO 객체 검출, ADE 세그멘테이션에서 모델을 평가하는 것.

실험 결과

연구 질문

RQ1로컬 어텐션 내 스파arsity 연결성과 가중치 공유가 로컬 비전 트랜스포머의 표현 용량에 어떻게 기여하는가?
RQ2고정 또는 공유 가중치 대비 동적 가중치 계산이 로컬 어텐션의 성능 향상에 얼마나 기여하는가?
RQ3깊이 분리 컨볼루션과 동적 가중치 변형 기반 모델이 정확도와 효율성 측면에서 Swin Transformer와 비교해 어떻게 성능을 내는가?
RQ4정규화 메커니즘(스파arsity 연결성과 가중치 공유)과 동적 가중치 간의 상대적 기여도는 무엇인가?
RQ5로컬 어텐션과 유사한 인덕티브 바이어스를 가진 더 단순한 아키텍처가 시각 벤치마크에서 Swin Transformer를 초월하거나 대등하게 성능을 낼 수 있는가?

주요 결과

깊이 분리 컨볼루션과 동적 가중치 변형 기반 모델이 ImageNet 분류에서 Swin Transformer와 동등한 성능을 기록함.
계산 복잡도가 낮은 동적 가중치 변형 모델이 COCO 객체 검출 및 ADE 세그멘테이션에서 Swin Transformer를 초월하거나 대등하게 성능을 내며.
로컬 어텐션 내 스파arsity 연결성과 가중치 공유가 모델 일반화 및 용량을 향상시키는 핵심 정규화 메커니즘이라는 점.
동적 가중치 메커니즘이 각 이미지 인스턴스별로 적응형 어텐션 가중치를 가능하게 하여 모델 복잡도 증가 없이 특징 표현을 향상시킴.
경험적 결과는 두 가지 정규화 형태와 동적 가중치 계산의 조합이 로컬 비전 트랜스포머의 뛰어난 성능을 이끌어내는 핵심 요소임을 확인함.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.