QUICK REVIEW

[論文レビュー] Vision Transformer Adapter for Dense Predictions

Zhe Chen, Yuchen Duan|arXiv (Cornell University)|May 17, 2022

Visual Attention and Saliency Detection被引用数 203

ひとこと要約

ViT-Adapterは、プレーンなVision Transformerを軽量な事前学習不要のアダプターで拡張し、画像の事前情報を注入し、マルチスケール特徴を再構築します。ViTアーキテクチャを変更せずに、密な予測タスクで最先端に近い性能を達成します。

ABSTRACT

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

研究の動機と目的

プレーンな ViT と視覚特化トランスフォーマーの密な予測タスクにおける性能ギャップを埋める動機付け。
ViTのバックボーンを変更せず、画像関連の誘導バイアスを注入する事前学習不要のアダプターを提案。
密な予測を可能にする三つのモジュール（空間 priors、空間特徴インジェクター、マルチスケール特徴抽出）を設計。
ViT-Adapterが物体検出、インスタンス分割、意味論的分割の各領域で競争力のあるまたは優れた結果を達成することを実証。

提案手法

ViT-Adapterを、プレーンなViTのバックボーンと三モジュールのアダプターという二部構成として導入。
空間 priors モジュールで、ConvNetのステムを用いて入力画像から1/8、1/16、1/32の三分解能の特徴ピラミッドを構築。
空間特徴インジェクターで、ViTトークンと空間特徴間のクロスアテンションを介して空間 priorsをViTへ融合。
マルチスケール特徴抽出子で、クロスアテンションとFFN演算を通じて階層的なマルチスケール特徴を生成し、密な予測のための特徴ピラミッドを実現。
ViT内の相互作用：ViTエンコーダをNブロックに分割（Nは通常4）。各ブロックで priors を注入しマルチスケール特徴を抽出；最終的に1/8、1/16、1/32の特徴をアップサンプリングして下流ヘッド用の1/4スケール特徴マップを形成。
アダプター内のデフォルトのスパースアテンションとして変形可能アテンションを用い、ViTの事前学習 weightsを保持するためにバランスのとれた初期化を採用。

実験結果

リサーチクエスチョン

RQ1事前学習不要のアダプターはPlain ViTに vision-specific inductive biases を導入して、密な予測タスクにおける vision-specific トランスフォーマーとのギャップを縮めることができるか。
RQ2空間 priors、クロスアテンションベースの特徴注入、マルチスケール特徴抽出は密な予測性能にどのように寄与するか。
RQ3ViT-Adapterは追加の事前学習データなしでPlain ViTのバックボーンを物体検出、インスタンス分割、意味論的分割で競争力のある結果を達成できるか。
RQ4マルチモーダル事前学習はImageNet事前学習のみと比較してViT-Adapterの性能をどの程度向上させるか。

主な発見

ViT-Adapterは、Vision-specificバックボーンと比較した場合、ImageNet-1K事前学習の下で物体検出、インスタンス分割、意味論的分割の各領域でプレーンなViTの性能を一貫して向上させる。
マルチモーダル事前学習を用いると、ViT-Adapter-LはCOCO test-devで追加の検出データなしで60.9の box APと53.0の mask APを達成。
ViT-Adapter-Sは同等の事前学習条件下でViT-Detやいくつかのvision-specificモデルを上回り、アダプターを介した画像 priors の効果的な転移を示す。
ImageNet-22K事前学習を用いた意味論的分割の結果ではViT-Adapter-B/LがSwin-B/Lらと競合するか優位であり、マルチモーダル事前学習は追加の利益を生む（例：ADE20KでViT-Adapter-L ★）。
アブレーション研究は、各コンポーネント（SPM、Spatial Feature Injector、Multi-Scale Feature Extractor）が性能向上に寄与することを確認し、全体のViT-Adapterがベースラインより大きな改善をもたらすことを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。