QUICK REVIEW

[論文レビュー] Convolutional Bypasses Are Better Vision Transformer Adapters

Shibo Jie, Zhihong Deng|arXiv (Cornell University)|Jul 14, 2022

Domain Adaptation and Few-Shot Learning被引用数 62

ひとこと要約

ConvpassはViTに軽量な訓練可能な畳み込みバイパスを挿入し、少 trainable パラメータでVTAB-1Kおよびfew-shotタスクで言語志向PETL手法を上回る性能を達成し、強力なドメイン汎化を示す。

ABSTRACT

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

研究の動機と目的

ViTにおける言語志向のPETLモジュールと視覚的帰納バイアスの不一致を強調する。
訓練済み重みを保持しつつ畳み込み帰納バイアスを追加する視覚志向のPETLモジュールとしてConvpassを提案する。
VTAB-1K、few-shot学習、ドメイン一般化設定でConvpassの有効性を示す。
Convpassが従来のPETL手法よりも少ない訓練可能パラメータで上回ることを示す。

提案手法

ConvpassをViTブロックに並列で挿入する畳み込みボトルネックブロックとして提案し、トークンの2D空間構造を再構成する。
3層のConvpassを使用: 1x1チャネル削減、3x3空間畳み込み、1x1チャネル拡張。
トークンを2Dパッチとして扱い、[cls]トークンを画像として扱うことで2D構造を復元する。
事前訓練済みViTの重みを固定し、Convpassモジュールと分類ヘッドのみを訓練する。
Convpassを含む並列の訓練可能パス（ConvpassとMHSA/MLPブロックを含む）を示すViTのほどけた視点で分析する。
視覚志向のConvpassを言語志向PETLモジュール（VPT、Adapter、AdaptFormer、LoRA、NOAH）と比較する。
ImageNet-20Kで事前訓練されたViT-B/16を用いたVTAB-1Kと、追加のCLIPベースのドメイン一般化実験を評価する。

実験結果

リサーチクエスチョン

RQ1視覚志向の適応モジュールは、ViTを視覚タスクで微調整する際に言語志向のPETLモジュールを上回ることができるか。
RQ2Convpassによる畳み込み帰納バイアスの導入は、データ効率を向上させるか。特に低データ領域（few-shotおよびVTAB-1Kのサブセット）で。
RQ3Convpassはドメイン一般化、CLIPのような視覚言語モデルを含むいくつかのベースラインPETL手法と比較して、どのように影響を与えるか。

主な発見

Convpass attn（MHSAと並置してConvpassを挿入）とConvpass（MHSA/MLPと並列）はVTAB-1Kで強力な性能を達成し、ConvpassはPETL手法の中で最良の平均結果を得る。
Convpass attnはVTAB-1Kの19タスク中12の最先端結果を達成し、Convpass (full)は最良の平均性能を得て、VTAB-1K全タスクで前SOTA(NOAH)より1.1ポイント高い。
ConvpassはViT-B/16（86Mバックボーン）で約0.33百万訓練可能パラメータを導入し、完全微調整よりはるかに少ないが精度は上回る。
Convpassは5つの細分類データセットで強力なfew-shot学習の利得を示し、ほとんどのショット設定でベースラインを上回り、データ効率の改善を示す。
CLIPを用いたドメイン一般化実験では、Convpass_CLIPはソースおよびほとんどのターゲットドメインでCLIP向けPETLベースラインを上回り、ドメインシフトに対する頑健性を示す。
固有の視覚的帰納バイアスを持つバックボーン（Swin、ConvNeXt）と比較しても、Convpassを用いたViTは biased backbonesの完全な微調整を上回ることができ、ViTが持つ視覚的帰納バイアスの欠如をConvpassが効果的に補完することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。