QUICK REVIEW

[论文解读] Exploring Efficient Few-shot Adaptation for Vision Transformers

Chengming Xu, Siqian Yang|arXiv (Cornell University)|Jan 6, 2023

Domain Adaptation and Few-Shot Learning被引用 8

一句话总结

本论文为 Vision Transformer 的少样本学习提出了高效的 Transformer 微调方法（eTT），通过 Attentive Prefix Tuning（APT）和 Domain Residual Adapter（DRA）在少量可训练参数下实现了 Meta-Dataset 的强性能表现。

ABSTRACT

The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with few trials focusing on naive finetuning of whole backbone or classification layer.} Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone tuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which maximizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model.

研究动机与目标

Motivate efficient ViT-based fine-tuning for few-shot learning (FSL) to overcome heavy computation and overfitting in base methods.
Propose a light-weight tuning framework (eTT) combining APT and DRA to adapt ViTs to novel tasks with limited data.
Introduce prototypical regularization to preserve task-specific knowledge during prefix updates.
Demonstrate state-of-the-art or competitive FSL performance on Meta-Dataset using ViT backbones without extra data for training.

提出的方法

Pretrain ViT with self-supervised DINO on base data to obtain robust representations.
Use Attentive Prefix Tuning (APT): initialize a task-specific visual prefix from attentive prototypes and attach learned key/value pairs to each self-attention layer.
Introduce Domain Residual Adapter (DRA): learn small domain-offset vectors per Transformer layer to bridge base/novel domain gaps.
Apply prototypical regularization to keep prefix updates aligned with initial prototypes by matching projected distributions.
Fine-tune with a combination of cross-entropy loss and a distillation-based prototypical regularization (L = L_CE + lambda L_dist).
Report efficiency: trainable parameters during finetuning amount to ~9% of the full ViT when using ViT-small/ViT-tiny.

Figure 1: (a) Comparing with other backbones, we propose the Domain Residual Adapter (DRA) to tune much less parameters in our efficient Transformer Tuning (eTT); and effective for large-scale FSL. (b) The few-shot support samples are first processed into attentive prototypes which are used to initi

实验结果

研究问题

RQ1How can ViT-based backbones be efficiently fine-tuned for few-shot learning without extensive retraining?
RQ2Can attentive prototypes and lightweight adapters improve task adaptation across domain shifts in FSL?
RQ3Does a prototypical regularization help preserve task-specific information during rapid fine-tuning?
RQ4What is the empirical performance of eTT on large-scale FSL benchmarks like Meta-Dataset compared to ResNet-based methods?

主要发现

Model	ILSVRC	Omni	Acraft	CUB	DTD	QDraw	Fungi	Flower	Sign	COCO	Avg
Proto	63.37	65.86	45.11	72.01	83.50	60.88	51.02	92.39	49.23	54.99	63.84
LT+NCC	65.96	67.62	64.03	77.10	83.46	63.88	57.79	93.13	66.91	56.04	69.59
Last	66.32	71.04	78.04	86.25	86.67	64.22	55.69	94.44	65.55	55.94	72.42
First	61.54	50.46	69.23	79.17	83.10	68.69	49.93	93.50	54.28	58.45	66.84
LN	66.22	70.45	69.41	81.29	86.37	66.28	58.38	96.25	71.09	59.57	72.53
APT	66.75	75.16	75.41	84.25	86.47	69.55	60.03	96.38	78.20	61.10	75.33
Adapter	66.53	72.31	73.75	83.73	86.86	66.74	58.49	96.15	82.65	62.40	74.93
eTT	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Random	66.12	76.33	78.35	84.77	86.78	70.13	59.25	96.00	82.28	59.59	75.96
Sampling	67.81	76.72	77.96	85.79	87.25	70.19	60.73	96.27	83.72	62.17	76.86
Full	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Linear	66.35	74.26	79.42	83.65	86.02	71.11	55.73	95.89	82.73	59.90	75.51
Bottleneck	67.29	76.06	79.72	85.60	87.21	70.59	61.59	96.15	85.00	62.02	77.12
FiLM	66.91	75.32	78.26	85.78	86.83	70.29	61.65	96.50	84.48	61.75	76.78
Offset	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
w/o PR	66.72	74.20	78.42	85.06	87.01	70.34	61.64	96.51	84.23	61.08	76.52
w PR	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
w/o Stand	67.09	76.42	78.87	83.10	86.50	70.09	61.02	96.33	82.88	61.33	76.36
w Stand	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61
Avg	66.11	75.06	77.07	85.16	87.35	70.72	61.79	96.54	84.28	62.18	76.73
Sampling (alt)	67.81	76.72	77.96	85.79	87.25	70.19	60.73	96.27	83.72	62.17	76.86
Full (alt)	67.37	78.11	79.94	85.93	87.62	71.34	61.80	96.57	85.09	62.33	77.61

eTT with ViT-small achieves strong Meta-Dataset performance, achieving competitive average ranks (e.g., 4.1 with ViT-tiny and ViT-small configurations; 1.6 average rank with ViT-s in some setups).
On several datasets, eTT outperforms strong baselines (e.g., Texture and Fungi) by notable margins versus CTX and TSA (approximately 8% and 10%).
eTT attains high efficiency, with learnable parameters during finetuning around 9% of the full ViT model for ViT-small configurations.
Using self-supervised DINO pretraining without extra training data yields favorable generalization for FSL on Meta-Dataset.
The method maintains robustness across large domain gaps by combining Attentive Prefix Tuning and Domain Residual Adapters.

Figure 2: Schematic illustration of our proposed model. For each image, we first fetch its patch embedding sequence and the attention score with regard to the last layer’s class token, from which the image embedding can be computed. Then the visual prefix is initialized as the attentive prototypes o

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。