QUICK REVIEW

[论文解读] CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Peng Gao, Shijie Geng|arXiv (Cornell University)|Oct 9, 2021

Multimodal Machine Learning Applications参考文献 48被引用 111

一句话总结

CLIP-Adapter 通过在残差混合下插入轻量级特征适配器来微调视觉-语言模型，为少样本任务提供一个简单、有效的替代 prompts 调优的方案。

ABSTRACT

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

研究动机与目标

提出在开放词汇的 CLIP 设置中，推动改进视觉-语言模型，超越仅限提示词调优。
提出一种轻量、瓶颈式的特征适配器，在冻结主干的同时对 CLIP 进行微调。
实现残差式混合，将新知识与零样本预训练知识相结合。
通过消融实验在十一个数据集和多种少样本设置上展示有效性。

提出的方法

在 CLIP 的图像和/或文本分支添加两个小的瓶颈线性适配器。
冻结原始 CLIP 主干，仅用少样本数据训练适配器。
使用由残差比率 α 和 β 控制的残差连接，将适配特征与原始特征混合。
使用现有的 W 形成分类器权重，并通过带残差混合的并行适配器对其进行调整。
可选地通过超网络学习 α 和 β，以实现数据集特定的微调。
探索三种变体：仅图像适配器、仅文本适配器以及两者都适配；默认使用图像适配器。

实验结果

研究问题

RQ1使用轻量级特征适配器进行微调，是否能够达到或超过用于少样本视觉-语言分类的提示调优方法？
RQ2残差连接和瓶颈设计是否有助于降低过拟合、在多样化数据集上提升泛化？
RQ3不同数据集特征下，最佳配置是什么（应适配哪些分支、瓶颈大小、残差比）？
RQ4可学习的残差比是否能在跨数据集上进一步提升性能？
RQ5与基于提示的方法相比，适配器如何影响学习到的特征流形？

主要发现

在十一个数据集的多种少样本设置下，CLIP-Adapter 的表现优于零样本 CLIP、线性探针 CLIP 以及 CoOp。
使用瓶颈适配器的残差混合在非常低样本情形（1–2 次）时尤为显示出强泛化能力。
仅对图像分支微调（视觉适配器）通常比仅文本适配带来更好的收益，且两者组合并非总是优越。
最优瓶颈维度约为 D/4，其中 D 是原始特征维度。过大或过小的瓶颈都会下降性能。
最佳残差比 α 趋势：细粒度数据集偏好较高的 α（0.6），通用数据集偏好较低的 α（约 0.2）；α=0 恢复零样本 CLIP，α=1 将过拟合。
通过超网络学习的 α、β 的变体在不需要手动调参的情况下也能达到有竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。