QUICK REVIEW

[논문 리뷰] TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Shi Dai|arXiv (Cornell University)|2023. 11. 28.

Cell Image Analysis Techniques인용 수 19

한 줄 요약

TransNeXt는 시각 트랜스포머에 Aggregated Attention과 Convolutional GLU를 도입해 생체 모사된 foveal 지각을 구현하고 깊이 저하를 피하며 ImageNet, 탐지 및 분할 전 영역에서 최첨단 정확도와 강건성을 달성합니다.

ABSTRACT

Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

연구 동기 및 목표

층 스태킹으로 인해 효율적 비전 트랜스포머에서 발생하는 깊이 저하를 해결하려는 동기를 제시한다.
깊은 스태킹 없이 토큰 단위의 글로벌 지각을 가능하게 하는 생체 모사 토큰 혼합기를 개발한다.
로컬 모델링과 강건성을 향상시키는 채널 믹서를 도입한다.
분류, 탐지 및 분할 작업에서 뛰어난 성능을 보이는 응집된 백본(TransNeXt)을 제안한다.

제안 방법

정밀한 로컬 어텐션과 조밀한 글로벌 풀링 경로를 결합하는 Pixel-focused Attention (PFA)을 도입한다.
학습 가능한 토큰과 위치 정보를 포함한 QKV, LKV, QLV 메커니즘을 포함하여 여러 어텐션 변형을 Aggregated Attention (AA)으로 집계한다.
다중 스케일 입력의 외삽을 개선하기 위해 길이 스케일 코사인 어텐션을 활용한다.
최근접 이웃 특징을 기반으로 한 게이트된 채널 어텐션 메커니즘으로서 Convolutional GLU를 제안하여 강건성을 강화한다.
PVTv2에 맞춰 설계된 AA와 Convolutional GLU를 포함하는 네 단계 계층 백본으로 TransNeXt를 구축한다.

실험 결과

연구 질문

RQ1집계된 생체 모사 어텐션이 깊이 저하를 극복하고 ViT에서의 정보 혼합을 깊은 스태킹 없이 개선할 수 있는가?
RQ2학습 가능한 쿼리 토큰과 다양한 위치 바이어스를 통합하면 QKV 유사도 이상으로 친화도 행렬 생성을 개선할 수 있는가?
RQ3합성곱 기반 채널 믹서(Convolutional GLU)가 ViT에서 로컬 특징 모델링과 모델 강건성을 향상시킬 수 있는가?
RQ4모델 규모에 따라 표준 및 강건성 중심의 시각 작업(ImageNet, ImageNet-A, COCO, ADE20K)에서 TransNeXt의 성능은 어떤가요?

주요 결과

TransNeXt-Tiny는 224^2에서 ImageNet-1K 상위 1% 정확도 84.0%를 달성하며 28.2M 매개변수 및 5.7G FLOPs를 가졌고, ConvNeXt-B보다 매개변수가 69% 적다.
TransNeXt-Base는 ImageNet-1K 상위 1% 정확도 86.2%, ImageNet-A 상위 1% 61.6%, COCO 객체 탐지에서 57.1 mAP, ADE20K 의미론적 분할에서 54.7 mIoU를 달성한다.
TransNeXt-Small은 384^2에서 ImageNet-1K 상위 1% 84.7% 및 ImageNet-A 58.3%를 달성한다; TransNeXt-Small/Base는 각각 IN-A 61.6%, IN-R 57.7%에 도달하여 로버스트니스 증가를 보여준다.
224^2에서 ImageNet-A에서 TransNeXt-Base가 MaxViT-Base보다 상위 1%가 6.4% 포인트 더 우수하다.
TransNeXt-Tiny/Small/Base는 ConvNeXt-L보다 강건성 이점이 있고, 여러 작업에서 더 큰 ViT 기반 백본과 비교하여 비슷하거나 우수한 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.