QUICK REVIEW

[論文レビュー] Glance-and-Gaze Vision Transformer

Qihang Yu, Yingda Xia|arXiv (Cornell University)|Jun 4, 2021

Visual Attention and Saliency Detection参考文献 46被引用数 33

ひとこと要約

GG-Transformer は Glance と Gaze のブランチを導入して長距離モデリングと局所的文脈を効率的に統合し、ImageNet、ADE20K、COCO で精度とコストのトレードオフを改善します。適応的に拡張される自己注意 (G-MSA) と深さ方向の Gaze ブランチを組み合わせ、局所性を補完します。

ABSTRACT

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

研究の動機と目的

高解像度の視覚タスクで密な予測を要する効率的な Transformer 設計を動機づける。
長距離注意と局所的なディテールを並列ブランチで組み合わせる Glance-and-Gaze Transformer ブロックを提案する。
GG-Transformer が ImageNet、ADE20K、COCO で従来の Transformers と比較して精度とコストのトレードオフを上回ることを示す。

提案手法

Glance ブランチ: 自適応的に拡張されたパーティション上で自己注意を行い、線形複雑度でグローバル受容野を保持する。
Gaze ブランチ: 値の統合に対して局所的文脈を補う深さ方向畳み込み。
GG-MSA: パーティション内でのマージとアテンションを組み合わせて、計算コストを抑えつつグローバルな視点を維持する（Ω(G-MSA)=4NC^2+2M^2NC）。
Gaze ブランチの選択肢: 局所特徴を補う固定または適応カーネルサイズ（適応が推奨）。
完全にパラレルな GG-Transformer ブロックを階層的バックボーンに組み込み、Swin-Transformer に似た4段階構造で公正な比較を行う。

実験結果

リサーチクエスチョン

RQ1Glance-and-Gaze (GG) ブロックは、局所的なディテールを保ちながら二次の二乗コストなしにグローバルな長距離モデリングを提供できるか？
RQ2GG-Transformer ブロックは、同等のモデルサイズで Swin-Transformer および他の ViT と比較して ImageNet、ADE20K、COCO の精度を向上させるか？
RQ3Glance と Gaze の要素は性能にどのように寄与し、両者の組み合わせは単独より優れているのか？
RQ4GG-MSA は DeiT のような既存 ViT アーキテクチャへのドロップイン置換として実現可能か？

主な発見

モデル	画像サイズ	パラメータ (M)	FLOPs (G)	ImageNet Top-1 (%)	mIoU (%)	mIoU(ms+flip) (%)	AP^b (Mask R-CNN)	AP^m (Mask R-CNN)	AP^b (Cascade Mask R-CNN)
GG-T	224	28	4.5	82.0	-	-	-	-	-
GG-S	224	50	8.7	83.4	-	-	-	-	-
Swin-T	224	28	4.5	81.2	-	-	-	-	-
Swin-S	224	50	8.7	83.4	-	-	-	-	-
DeiT-T	224	22	4.6	81.0	-	-	-	-	-
DeiT-S	224	86	17.5	81.8	-	-	-	-	-
GG-T (ours)	224	28	4.5	82.0	-	-	-	-	-
GG-S (ours)	224	50	8.7	83.4	-	-	-	-	-

GG-Transformer は ImageNet において、同等の FLOPs とパラメータを持つ比較可能な Vision Transformer より高い精度を達成。
GG-T および GG-S は同じモデルサイズと計算コストで Swin-T/S に匹敵するか上回るトップ1精度を達成し、ImageNet (224^2) で GG-T が 82.0%、GG-S が 83.4%。
ADE20K では GG-T が単一スケールで 46.4% mIoU、テスト時オーグメンテーションで 47.2% を達成し、ResNet50、PVT-Small、Swin-T のベースラインを上回る。GG-S は Swin-S を mIoU(48.4%/49.6%) で上回る。
COCO オブジェクト検出では、GG-T および GG-S のバックボーンが同等サイズの CNN および ViT より高い AP を達成；GG-T は Mask R-CNN/Cascade Mask R-CNN 設定で 44.1 AP^b、39.9 AP^m を達成し、比較コストの Swin-T を上回る。
アブレーションで Conv ベースの Gaze を用いた Glance+Gaze は MSA のみや Swin-風の局所窓アプローチを上回り、Glance+Gaze (Conv) は Swin-T ベースライン設定で ImageNet の top-1 が 80.28% に到達。
GG-MSA は DeiT のバックボーンも改善可能であり (GG-DeiT-T 73.8%、GG-DeiT-S 80.5%)、Swin ライクなアーキテクチャを超えた多用途性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。