QUICK REVIEW

[論文レビュー] Focal Modulation Networks

Jianwei Yang, Chunyuan Li|arXiv (Cornell University)|Mar 22, 2022

Advanced Neural Network Applications被引用数 148

ひとこと要約

Focal Modulation Networks は自己注意を、深-wise 畳み込みを介して多尺度コンテキストを集約し、各トークンのクエリをモジュレートする焦点モジュレーション・モジュールに置換し、分類・検出・セグメンテーションで最先端の精度を、効率的な競合性とともに達成します。

ABSTRACT

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1 imes outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3 imes schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

研究の動機と目的

視覚タスクにおける入力依存の長距離相互作用をモデル化する注意機構不要のメカニズムを開発する。
多層階層的モジュレーションを用いて短距離および長距離の視覚コンテキストを捕捉する。
分類・検出・セグメンテーションにおいて、最先端の自己注意ベースモデルと比較して精度と効率の向上を実証する。

提案手法

自己注意を、最初に depth-wise convolution による複数の焦点レベルでコンテキストを集約する Focal Modulation モジュールに置換する。
集約されたコンテキストからゲーティング集約機構を介してモジュレーターを計算し、それを要素ごとのアフィン様相互作用のようにクエリへ注入する。
二段階のコンテキスト集約を用いる： (i) depth-wise convolution のスタックによる階層的文脈化、(ii) per-token のモジュレーターを形成するゲーティング集約。
Define y_i = q(x_i) ⊙ h(Z_out) where Z_out encodes multi-level context and gating weights select the contribution of each level.
非線形性（GeLU）とモジュレーションの翻訳不変性と入力依存性を明示的に保つ設計上の選択を組み込む。
複雑さについて議論する：主に 3C^2 + C(2L+3) + C∑(k^ℓ)^2 で決まり、効率的で注意機構不要なトークン相互作用を実現する。

実験結果

リサーチクエスチョン

RQ1自己注意を用いないモジュレーション機構が、分類・検出・セグメンテーションの視覚モデルにおいて自己注意と同等またはそれを上回る性能を達成できるか。
RQ2多尺度の文脈集約と各トークンへのモジュレーションが、計算効iciencyを維持または向上させつつ精度を改善できるか。
RQ3Focal Modulation は dense prediction タスクおよび大規模事前学習設定において、Swin や Focal Transformers とどのように比較されるか。
RQ4焦点モジュレーションが、伝統的な注意機構と比較してどのような定性的解釈性の利点をもたらすか。

主な発見

FocalNets は、画像分類・物体検出・セグメンテーションの各領域で、コストをほぼ同等に保ちながら最先端の自己注意ベースの counterparts を上回る。
Tiny および base の FocalNets は ImageNet-1K でそれぞれ top-1 が 82.3% と 83.9% を達成し、ImageNet-22K で事前学習した場合（224^2/384^2 微調整時）さらなる利得がある（86.5%/87.3% top-1）。
COCO の物体検出では、1×スケジュールで学習した FocalNet base が Swin よりも 3× の schedule に対して優れており、検出の変種でも競争力のある結果を示す。
ADE20K のセグメンテーションでは、FocalNet base が single-scale で 50.5 mIoU を達成し、Swin の multi-scale（49.7）を上回る；large FocalNet は ADE20K と COCO Panoptic でそれぞれ 58.5 mIoU と 57.9 PQ を達成する。
より大きなバックボーンと学習スキームと組み合わせた場合、FocalNets は Swinv2-G や BEIT-3 のようなモデルで COCO minival / test-dev の mAP における新しい SOTA を樹立する。
視覚的な可視化は、モジュレーターが認識カテゴリを誘発する物体領域に焦点を合わせることを示しており、FocalNets の解釈性を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。