QUICK REVIEW

[論文レビュー] Dilated Neighborhood Attention Transformer

Ali Hassani, Humphrey Shi|arXiv (Cornell University)|Sep 29, 2022

Advanced Neural Network Applications被引用数 28

ひとこと要約

DiNATは局所的なNeighborhood Attentionを拡張するためのダイレーション付きスパースなグローバルアテンション（DiNA）を組み合わせ、追加の計算量を増やすことなく受容野を指数関数的に拡張し、いくつかの視覚タスクで最先端級の結果に近づけ、オープンソースとして公開します。

ABSTRACT

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

研究の動機と目的

階層的なビジョントランスフォーマーにおける局所的なアテンションの効率とグローバルコンテキストのバランスを取る必要性を動機づける。
拡張されたNeighborhood Attentionをダイレーションで拡張し、スパースなグローバルアテンションを実現するDiNAを提案する。
NAとDiNAを交互の層で活用する階層的ビジョントランスフォーマー（DiNAT）を構築する。
画像分類、物体検出、セマンティングの上でDiNATを評価し、NAT、Swin、ConvNeXtと比較する。
スパース/グローバルアテンションパターンを持つビジョントランスフォーマーの研究を支援する実装とリリースを提供する。

提案手法

DiNAをNeighborhood Attention（NA）のダイレーションベースの拡張として定義し、i番目のトークンのj番目に近い隣接をダイレーション制約の下で選択する。
DiNAアテンションウェイトA_i^{(k,δ)}をQ_iとK_{ρ_j^{δ}(i)}および対応するB(i, ρ_j^{δ}(i))を用いて定式化する。
DiNAの出力をDiNA_{k}^{δ}(i) = softmax(A_i^{(k,δ)}/sqrt(d)) V_i^{(k,δ)}とし、dは埋め込み次元とする。
DiNATをNAとDiNAを層ごとに組み合わせ、各層でのダイレーションδとダウンサンプリング方式を持つ階層的トランスフォーマーとして導入する。
受容野の成長を分析する: DiNAは受容野をNATの線形成長からℓ層にわたる指数成長k^{ℓ}まで成長させることができる。
ダイレーション対応を可能にするNATTEN（CUDAカーネル）を拡張して、効率的なDiNA実装を可能にする。

実験結果

リサーチクエスチョン

RQ1ダイレーション付きのスパースグローバルアテンション（DiNA）は、局所性を維持しつつ受容野を局所的なアテンションだけよりも拡張できるか。
RQ2NAとDiNAを組み合わせたハイブリッドアーキテクチャ（DiNAT）は、分類・検出・セグメンテーションで強力なベースライン（NAT、Swin、ConvNeXt）を上回るか。
RQ3各層のダイレーション値が受容野、効率、下流タスクの性能に与える影響はどのようか。
RQ4DiNATはデータ（ImageNet-1K/22K）とタスク（分類、検出、セグメンテーション、パンオプティック）に対してどのようにスケールするか。
RQ5NA/DiNAバリアントと完全自己注意との等方性対階層性のトレードオフはどうなるか。

主な発見

DiNAはNAをスパースなグローバルアテンションで拡張し、追加の計算コストなしに受容野を指数関数的に拡張する。
DiNATはNAとDiNAの層を積み重ねることで、下流の視覚タスクにおいてNAT、Swin、ConvNeXtのベースラインを上回る改善を示す。
ImageNet-1Kの224^2でDiNATの変種はNATに匹敵またはそれを超え、いくつかの構成でConvNeXt/Swinのベースラインに対して競争力を持つ。ImageNet-22Kでの大規模事前学習はSwin-LargeおよびConvNeXt-Largeに対して競争力のある性能を示す。
DiNAT-LはImageNet-1Kでトップ1が86.6%、パラメータ200M、30.6 GFLOPsで、追加データなしで下流セグメンテーションタスク（パンオプティック/インスタンス/セマンティック）において競争力のある、あるいは優位な性能を示す。
等方性のバリアント（DiNATおよびViT風の等方設計を持つNAT）は、FLOPsとパラメータを揃えた場合、ViTおよびConvNeXtのベースラインに対して競争力のある精度を示す。
オープンソースリリースはNA/NATTENをダイレーション対応で拡張し、視覚タスク全般および将来的にはそれ以外の研究にも対応する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。