QUICK REVIEW

[論文レビュー] Exploring Efficient Few-shot Adaptation for Vision Transformers

Chengming Xu, Siqian Yang|arXiv (Cornell University)|Jan 6, 2023

Domain Adaptation and Few-Shot Learning被引用数 8

ひとこと要約

この論文は、Attentive Prefix Tuning (APT) と Domain Residual Adapter (DRA) を用いた Vision Transformers の少数ショット学習のための効率的な Transformer チューニング（eTT）を紹介し、 trainable パラメータを少なく抑えつつ Meta-Dataset で強力な性能を達成する。

ABSTRACT

The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with few trials focusing on naive finetuning of whole backbone or classification layer.} Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone tuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which maximizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model.

研究の動機と目的

効率的な ViT ベースのファインチューニングを提案し、ベース手法の高い計算量と過学習を克服する。
軽量なチューニングフレームワーク（eTT）を提案し、限られたデータで ViT を新規タスクへ適応させる。
prefix 更新時のタスク特有の知識を保持するためのプロトタイプ正則化を導入する。
ViT バックボーンを用いて Meta-Dataset のような大規模 FSL ベンチマークで、追加データなしで最先端クラスの性能または競争力のある FSL 性能を示す。

提案手法

ベースデータで自己教師あり DINO によって ViT を事前学習し、堅牢な表現を得る。
Attentive Prefix Tuning (APT) を使用：アテンティブなプロトタイプからタスク固有のビジュアル prefix を初期化し、各自己注意層に学習済みの key/value ペアを追加する。
Domain Residual Adapter (DRA) を導入：Transformer 各層ごとに小さなドメインオフセットベクターを学習し、ベース/新規ドメイン間のギャップを橋渡しする。
プロトタイプリケーションを適用して、射影分布を一致させることで prefix 更新を初期プロトタイプと整合させる。
クロスエントロピー損失と蒸留ベースのプロトタイプリギュラリゼーションの組み合わせでファインチューニングを行う（L = L_CE + lambda L_dist）。
効率性を報告：ViT-small/ViT-tiny 使用時のファインチューニング時に trainable パラメータは全体の約9% に相当する。

Figure 1: (a) Comparing with other backbones, we propose the Domain Residual Adapter (DRA) to tune much less parameters in our efficient Transformer Tuning (eTT); and effective for large-scale FSL. (b) The few-shot support samples are first processed into attentive prototypes which are used to initi

実験結果

リサーチクエスチョン

RQ1ViT ベースのバックボーンを、大規模な再訓練を伴わずにどのように効率的にファインチューニングできるか？
RQ2Attentive prototypes と軽量アダプタは FSL におけるドメインシフトを跨いだタスク適応を改善できるか？
RQ3プロトタイプリギュラリゼーションは急速なファインチューニング中のタスク特有情報の保持に役立つか？
RQ4eTT の Meta-Dataset のような大規模 FSL ベンチマークでの経験的性能は、ResNet ベース手法と比べてどうか？

主な発見

Model	ILSVRC	Omni	Acraft	CUB	DTD	QDraw	Fungi	Flower	Sign	COCO	Avg
Proto	63.37	65.86	45.11	72.01	83.50	60.88	51.02	92.39	49.23	54.99	63.84
LT+NCC	65.96	67.62	64.03	77.10	83.46	63.88	57.79	93.13	66.91	56.04	69.59
Last	66.32	71.04	78.04	86.25	86.67	64.22	55.69	94.44	65.55	55.94	72.42
First	61.54	50.46	69.23	79.17	83.10	68.69	49.93	93.50	54.28	58.45	66.84
LN	66.22	70.45	69.41	81.29	86.37	66.28	58.38	96.25	71.09	59.57	72.53
APT	66.75	75.16	75.41	84.25	86.47	69.55	60.03	96.38	78.20	61.10	75.33
Adapter	66.53	72.31	73.75	83.73	86.86	66.74	58.49	96.15	82.65	62.40	74.93
eTT	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Random	66.12	76.33	78.35	84.77	86.78	70.13	59.25	96.00	82.28	59.59	75.96
Sampling	67.81	76.72	77.96	85.79	87.25	70.19	60.73	96.27	83.72	62.17	76.86
Full	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Linear	66.35	74.26	79.42	83.65	86.02	71.11	55.73	95.89	82.73	59.90	75.51
Bottleneck	67.29	76.06	79.72	85.60	87.21	70.59	61.59	96.15	85.00	62.02	77.12
FiLM	66.91	75.32	78.26	85.78	86.83	70.29	61.65	96.50	84.48	61.75	76.78
Offset	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
w/o PR	66.72	74.20	78.42	85.06	87.01	70.34	61.64	96.51	84.23	61.08	76.52
w PR	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
w/o Stand	67.09	76.42	78.87	83.10	86.50	70.09	61.02	96.33	82.88	61.33	76.36
w Stand	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Avg	66.11	75.06	77.07	85.16	87.35	70.72	61.79	96.54	84.28	62.18	76.73
Sampling (alt)	67.81	76.72	77.96	85.79	87.25	70.19	60.73	96.27	83.72	62.17	76.86
Full (alt)	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61

ViT-small を用いた eTT は Meta-Dataset で強力な性能を発揮し、平均ランクが競合的である（例：ViT-tiny と ViT-small 構成で 4.1、ViT-s では一部設定で平均ランク 1.6）。
複数のデータセットで、eTT は CTX と TSA に対して顕著なマージンで強力なベースラインを上回る（約8%〜10%程度）。
eTT は高い効率を達成し、ViT-small 構成ではファインチューニング時の学習可能パラメータが全体の約9%程度に留まる。
自己監視型 DINO の事前学習のみで追加訓練データなしでも Meta-Dataset での FSL の一般化が好ましい。
本手法は Attentive Prefix Tuning と Domain Residual Adapters を組み合わせることで大規模なドメインギャップにも頑健性を維持する。

Figure 2: Schematic illustration of our proposed model. For each image, we first fetch its patch embedding sequence and the attention score with regard to the last layer’s class token, from which the image embedding can be computed. Then the visual prefix is initialized as the attentive prototypes o

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。