QUICK REVIEW

[論文レビュー] TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Shi Dai|arXiv (Cornell University)|Nov 28, 2023

Cell Image Analysis Techniques被引用数 19

ひとこと要約

TransNeXtは Aggregated Attention と Convolutional GLU をビジョン・トランスフォーマーに導入し、生体模倣の中心視野知覚を生み出し、深さの劣化を回避するとともに、ImageNet、検出、セマンティック分割全体で最先端の精度と頑健性を実現します。

ABSTRACT

Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

研究の動機と目的

層の積み重ねによって生じる効率的な Vision Transformer における深度劣化に対処するための動機付け。
深い積み重ねなしに各トークンでグローバルな知覚を可能にする生体模倣のトークンミキサーを開発する。
局所モデリングと頑健性を高めるチャネルミキサーを導入する。
分類、検出、セグメンテーションタスクで優れた性能を発揮する統合された backbone (TransNeXt) の提案。

提案手法

Pixel-focused Attention (PFA) を導入し、細粒度のローカル注意と粗いグローバルプーリング経路を組み合わせる。
QKV、LKV、QLV 機構を含む learnable tokens と位置情報を備えた複数の注意モードを Aggregated Attention (AA) に統合する。
長さスケールのコサイン注意を用いて多段階入力の外挿性を向上させる。
最近傍特徴に基づくゲート付きチャネル注意機構として Convolutional GLU を提案し、頑健性を高める。
TransNeXt を AA と Convolutional GLU を取り入れた4段階の階層的 backbone として構築し、PVTv2 への設計整合性を持たせる。

実験結果

リサーチクエスチョン

RQ1深い積み重ねなしに、集約的で生体模倣の注意機構は深度劣化を克服し、ViT における情報の混合を改善できるか。
RQ2learnable query tokens と多様な position biases の統合は、QKV の類似性を超えたアフィニティマトリクス生成を改善するか。
RQ3畳み込みベースのチャネルミキサー（Convolutional GLU）は ViT における局所特徴のモデリングとモデル頑健性を向上させるか。
RQ4モデルサイズを横断する標準的および頑健性重視のビジョンタスク（ImageNet、ImageNet-A、COCO、ADE20K）における TransNeXt の性能はどうなるか。

主な発見

TransNeXt-Tiny は 224^2 で ImageNet-1K の top-1 精度 84.0% を達成し、28.2M パラメータと 5.7G FLOPs、ConvNeXt-B よりパラメータを 69% 減らして上回る。
TransNeXt-Base は 86.2% ImageNet-1K top-1 accuracy, 61.6% ImageNet-A top-1 accuracy, 57.1 mAP on COCO object detection, and 54.7 mIoU on ADE20K semantic segmentation.
TransNeXt-Small は 84.7% ImageNet-1K top-1 accuracy and 58.3% ImageNet-A at 384^2; TransNeXt-Small/Base reach 61.6% and 57.7% on IN-A and IN-R respectively, illustrating robustness gains.
On ImageNet-A at 224^2, TransNeXt-Base outperforms MaxViT-Base by 6.4% top-1.
TransNeXt-Tiny/Small/Base show robustness advantages over ConvNeXt-L and comparable or superior performance to larger ViT-based backbones across multiple tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。