QUICK REVIEW

[論文レビュー] BiFormer: Vision Transformer with Bi-Level Routing Attention

Lei Zhu, Xinjiang Wang|arXiv (Cornell University)|Mar 15, 2023

Advanced Image and Video Retrieval Techniques被引用数 64

ひとこと要約

BiFormer は双レベルルーティングアテンション（BRA）を導入する。これは動的でクエリ認識型のスパースアテンション機構で、トークンレベルのアテンションを実行する前にリージョンレベルの関連性をフィルタリングし、視覚タスク全体で精度と効率を向上させる。ImageNet での結果は低い FLOPs で堅実であり、検出やセグメンテーションへの転移も良好である。

ABSTRACT

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a extbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.

研究の動機と目的

視覚変換器における効率的で長距離モデリングの必要性を、精度を犠牲にせずに説明する。
関連リージョンとトークンを選択的に注意する動的でクエリ認識型のスパース性機構（BRA）を提案する。
BRA を用いた BiFormer を、画像分類、検出、セグメンテーションのためのバックボーンとして開発する。
適切なリージョン分割の下で BRA が従来の注意機構より低い複雑さを達成することを示す複雑さ分析を提供する。
同等の FLOPs のもとで ImageNet-1K、COCO、ADE20K において最先端または競合的な性能を示す。

提案手法

最初にリージョンレベルのアフィニティグラフを構築し、各リージョンごとに上位 k 本の接続を保持する双レベルルーティングアテンション（BRA）を導入する。
リージョンごとに平均化したリージョンレベルのクエリ/キーを算出する。
ルーティングされたリージョンから対応するキー/バリューを集め、スパースだが集約されたセット上でトークン間アテンションを行う。
深層化畳み込みによるローカルコンテキスト強化項を追加する。

実験結果

リサーチクエスチョン

RQ1リージョンレベルのルーティングとトークンレベルのアテンションを組み合わせると、従来の注意機構や静的なスパースアテンションより計算量を抑えつつ精度が達成できるか。
RQ2BRA は入力解像度とリージョン分割にどうスケールし、FLOPs/精度のトレードオフはどうなるか。
RQ3BiFormer は類似の計算資源下で分類・検出・セグメンテーションの性能を改善するか。
RQ4BRA が異なるクエリに対して意味的に関連するリージョンをアテンションすることを示す定性的証拠は何か。

主な発見

モデル	FLOPs（G）	パラメータ（M）	Top-1 精度 (%)
BiFormer-T	2.2	13.1	81.4
BiFormer-S	4.5	26	83.8
BiFormer-B	9.8	57	84.3
BiFormer-S*	4.5	26	84.3
BiFormer-B*	9.8	58	85.4
Swin-T	4.5	29	81.3
QuadTree-B-b1	2.3	13.6	80.0
QuadTree-B2	4.5	24	82.7
WaveViT-S*	4.7	23	83.9

BRA は適切なリージョン分割下で O((HW)^(4/3)) の複雑さを達成し、従来の注意機構に比べて計算量を削減する。
BRA を搭載した BiFormer は ImageNet-1K で Top-1 81.4%（BiFormer-T、約 2.2 GFLOPs）、Top-1 83.8%（BiFormer-S、約 4.5 GFLOPs）を達成し、同様の予算で複数のベースラインを上回る。
COCO では BiFormer-S/BIFormer-B が競合的な mAP を提供し、 RetinaNet/Mask R-CNN パイプラインで特に小物体性能が向上。
ADE20K では BiFormer-S/B が Semantic FPN および UperNet の下でいくつかのスパースアテンションベースラインを上回り、mIoU の利得を達成。
アブレーションにより、BRA が分類タスクとセグメンテーションタスクの両方で他のスパースアテンション機構より優れていることが確認された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。