QUICK REVIEW

[논문 리뷰] On the Connection between Local Attention and Dynamic Depth-wise Convolution

Qi Han, Zejia Fan|arXiv (Cornell University)|2021. 06. 08.

CCD and CMOS Imaging Sensors인용 수 70

한 줄 요약

본 논문은 로컬 어텐션을 채널별로 국부적으로 연결된 계층으로 해석하고, 이를 동적 가중치를 가지는 방식과 depth-wise convolution 및 그 동적 변형들과의 연결고리를 실험적 어블레이션과 다양한 비전 태스크 간 비교를 통해 제시한다.

ABSTRACT

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

연구 동기 및 목표

네트워크 정규화 관점(희소 연결성 및 가중치 공유)과 동적 가중치 계산을 통해 로컬 어텐션에 대한 이해를 고취한다.
로컬 어텐션을 채널-와이즈 로컬 연결 계층으로 재정의하고 동적 가중치를 적용한다.
이론적으로 및 경험적으로 로컬 어텐션과 (다이나믹) depth-wise convolution 간의 연결을 조사한다.
ImageNet, COCO, ADE20K에서 DWNet(깊이별 합성 기반 네트워크)을 로컬 어텐션 기반 Swin Transformer와 대조 평가한다.
효율성과 성능 향상을 위한 가중치 공유 및 동적 가중치 메커니즘에 대한 실용적 통찰을 제공한다.

제안 방법

로컬 어텐션을 채널-와이즈 공간적으로 국부적으로 연결된 계층으로 재정의하고 동적 가중치를 부여한다.
로컬 어텐션과 depth-wise convolution 간의 희소 연결성, 가중치 공유 패턴, 그리고 동적 가중치 계산을 비교한다.
동일한 아키텍처 및 윈도우 설정 하에서 Swin Transformer의 로컬 어텐션을 depth-wise convolution으로 대체하여 DWNet을 제안한다.
전역 풀 기반 또는 중심 위치 기반 가중치 예측을 갖는 depth-wise convolution의 동형(Homogeneous) 및 이형(Inhomogeneous) 다이나믹 변형을 도입한다.
가중치 공유, 동적 가중치, 윈도우 샘플링 전략의 영향력을 평가하기 위한 어블레이션 연구를 수행한다.
Swin Transformer 유사 학습 프로토콜 하에서 ImageNet, COCO, ADE20K에서 벤치마크를 수행한다.

실험 결과

연구 질문

RQ1연결성, 가중치 공유 및 동적 가중치 계산 측면에서 로컬 어텐션은 depth-wise convolution과 어떤 관련이 있는가?
RQ2동적 depth-wise convolution 변형(DWNet)이 ImageNet 분류, COCO 객체 검출, ADE 의미론적 분할에서 로컬 어텐션과 경쟁력 있는 성능을 달성하는가?
RQ3가중치 공유와 동적 가중치 예측이 로컬 어텐션 대 depth-wise convolution의 효과성에 어떤 역할을 하는가?
RQ4비교 가능한 학습 설정에서 depth-wise convolution 기반 아키텍처(DWNet)가 Swin Transformer와 같은 수준에 도달하거나 이를 초과할 수 있는가?

주요 결과

로컬 어텐션은 채널별로 로컬 연결된 계층으로 동적 가중치를 가지며, 채널 간 가중치를 공유하고 인스턴스별 가중치 예측으로 결합한다.
depth-wise convolution은 공간 위치 간에 가중치를 공유하고 채널 및/또는 위치 간 가중치 공유의 이점을 활용하며, 선형 투영이나 중심 기반 예측을 사용하는 다이나믹 변형이 있다.
DWNet 및 그 다이나믹 변형은 ImageNet, COCO, ADE 의미론적 분할에서 Swin Transformer와 비슷하거나 약간 더 높은 성능을 달성하였고, 여러 설정에서 계산 비용이 더 낮다.
채널 간 가중치 공유는 로컬 어텐션의 파라미터 수를 감소시키는 데 도움이 되고, 위치 간 공유는 depth-wise convolution의 파라미터 감소 및 변환 불변 표현을 가능하게 한다.
동적 가중치 기제는 로컬 어텐션과 depth-wise convolution 모두의 성능을 향상시키며, 특정 설정에서 선형 투영 기반 동적 가중치가 주의 기반 스킴보다 선호되는 경향이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.