[论文解读] Exploring Efficient Few-shot Adaptation for Vision Transformers
本论文为 Vision Transformer 的少样本学习提出了高效的 Transformer 微调方法(eTT),通过 Attentive Prefix Tuning(APT)和 Domain Residual Adapter(DRA)在少量可训练参数下实现了 Meta-Dataset 的强性能表现。
The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with few trials focusing on naive finetuning of whole backbone or classification layer.} Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone tuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which maximizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model.
研究动机与目标
- Motivate efficient ViT-based fine-tuning for few-shot learning (FSL) to overcome heavy computation and overfitting in base methods.
- Propose a light-weight tuning framework (eTT) combining APT and DRA to adapt ViTs to novel tasks with limited data.
- Introduce prototypical regularization to preserve task-specific knowledge during prefix updates.
- Demonstrate state-of-the-art or competitive FSL performance on Meta-Dataset using ViT backbones without extra data for training.
提出的方法
- Pretrain ViT with self-supervised DINO on base data to obtain robust representations.
- Use Attentive Prefix Tuning (APT): initialize a task-specific visual prefix from attentive prototypes and attach learned key/value pairs to each self-attention layer.
- Introduce Domain Residual Adapter (DRA): learn small domain-offset vectors per Transformer layer to bridge base/novel domain gaps.
- Apply prototypical regularization to keep prefix updates aligned with initial prototypes by matching projected distributions.
- Fine-tune with a combination of cross-entropy loss and a distillation-based prototypical regularization (L = L_CE + lambda L_dist).
- Report efficiency: trainable parameters during finetuning amount to ~9% of the full ViT when using ViT-small/ViT-tiny.

实验结果
研究问题
- RQ1How can ViT-based backbones be efficiently fine-tuned for few-shot learning without extensive retraining?
- RQ2Can attentive prototypes and lightweight adapters improve task adaptation across domain shifts in FSL?
- RQ3Does a prototypical regularization help preserve task-specific information during rapid fine-tuning?
- RQ4What is the empirical performance of eTT on large-scale FSL benchmarks like Meta-Dataset compared to ResNet-based methods?
主要发现
| Model | ILSVRC | Omni | Acraft | CUB | DTD | QDraw | Fungi | Flower | Sign | COCO | Avg | Rank |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proto | 63.37 | 65.86 | 45.11 | 72.01 | 83.50 | 60.88 | 51.02 | 92.39 | 49.23 | 54.99 | 63.84 | |
| LT+NCC | 65.96 | 67.62 | 64.03 | 77.10 | 83.46 | 63.88 | 57.79 | 93.13 | 66.91 | 56.04 | 69.59 | |
| Last | 66.32 | 71.04 | 78.04 | 86.25 | 86.67 | 64.22 | 55.69 | 94.44 | 65.55 | 55.94 | 72.42 | |
| First | 61.54 | 50.46 | 69.23 | 79.17 | 83.10 | 68.69 | 49.93 | 93.50 | 54.28 | 58.45 | 66.84 | |
| LN | 66.22 | 70.45 | 69.41 | 81.29 | 86.37 | 66.28 | 58.38 | 96.25 | 71.09 | 59.57 | 72.53 | |
| APT | 66.75 | 75.16 | 75.41 | 84.25 | 86.47 | 69.55 | 60.03 | 96.38 | 78.20 | 61.10 | 75.33 | |
| Adapter | 66.53 | 72.31 | 73.75 | 83.73 | 86.86 | 66.74 | 58.49 | 96.15 | 82.65 | 62.40 | 74.93 | |
| eTT | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 | |
| Random | 66.12 | 76.33 | 78.35 | 84.77 | 86.78 | 70.13 | 59.25 | 96.00 | 82.28 | 59.59 | 75.96 | |
| Sampling | 67.81 | 76.72 | 77.96 | 85.79 | 87.25 | 70.19 | 60.73 | 96.27 | 83.72 | 62.17 | 76.86 | |
| Full | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 | |
| Linear | 66.35 | 74.26 | 79.42 | 83.65 | 86.02 | 71.11 | 55.73 | 95.89 | 82.73 | 59.90 | 75.51 | |
| Bottleneck | 67.29 | 76.06 | 79.72 | 85.60 | 87.21 | 70.59 | 61.59 | 96.15 | 85.00 | 62.02 | 77.12 | |
| FiLM | 66.91 | 75.32 | 78.26 | 85.78 | 86.83 | 70.29 | 61.65 | 96.50 | 84.48 | 61.75 | 76.78 | |
| Offset | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 | |
| w/o PR | 66.72 | 74.20 | 78.42 | 85.06 | 87.01 | 70.34 | 61.64 | 96.51 | 84.23 | 61.08 | 76.52 | |
| w PR | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 | |
| w/o Stand | 67.09 | 76.42 | 78.87 | 83.10 | 86.50 | 70.09 | 61.02 | 96.33 | 82.88 | 61.33 | 76.36 | |
| w Stand | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 | |
| Avg | 66.11 | 75.06 | 77.07 | 85.16 | 87.35 | 70.72 | 61.79 | 96.54 | 84.28 | 62.18 | 76.73 | |
| Sampling (alt) | 67.81 | 76.72 | 77.96 | 85.79 | 87.25 | 70.19 | 60.73 | 96.27 | 83.72 | 62.17 | 76.86 | |
| Full (alt) | 67.37 | 78.11 | 79.94 | 85.93 | 87.62 | 71.34 | 61.80 | 96.57 | 85.09 | 62.33 | 77.61 |
- eTT with ViT-small achieves strong Meta-Dataset performance, achieving competitive average ranks (e.g., 4.1 with ViT-tiny and ViT-small configurations; 1.6 average rank with ViT-s in some setups).
- On several datasets, eTT outperforms strong baselines (e.g., Texture and Fungi) by notable margins versus CTX and TSA (approximately 8% and 10%).
- eTT attains high efficiency, with learnable parameters during finetuning around 9% of the full ViT model for ViT-small configurations.
- Using self-supervised DINO pretraining without extra training data yields favorable generalization for FSL on Meta-Dataset.
- The method maintains robustness across large domain gaps by combining Attentive Prefix Tuning and Domain Residual Adapters.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。