QUICK REVIEW

[论文解读] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Renrui Zhang, Rongyao Fang|arXiv (Cornell University)|Nov 6, 2021

Multimodal Machine Learning Applications参考文献 66被引用 128

一句话总结

Tip-Adapter 构建一个训练零成本、非参数的两层 MLP 适配器，从少样本缓存中获得，以增强 CLIP，与基于训练的适配器相比，在少样本性能和收敛速度方面具有竞争力。

ABSTRACT

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose extbf{T}raining-Free CL extbf{IP}- extbf{Adapter} ( extbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.

研究动机与目标

在不进行完整适配器微调或提示设计的情况下，推动提升 CLIP 的少样本能力。
提出一个无训练、基于缓存的适配器，将少样本知识与预训练的 CLIP 特征融合。
在多样的数据集和骨干网络上展示具有竞争力的少样本分类性能。
展示由缓存初始化的微调在快速收敛的同时进一步提升性能。

提出的方法

向 CLIP 添加一个带有残差连接的两层 MLP 适配器。
从 K-shot 训练集中构建一个键值缓存，其中键是 CLIP 的视觉特征，值是一热编码标签。
直接从缓存设置适配器的权重 W1 和 W2（W1 = F_train，W2 = L_train^T），以实现适配器无训练。
将测试时的 logits 计算为缓存传播预测和预训练 CLIP 预测的组合，由残差比 alpha 平衡。
可选地解冻 W1 并进行若干轮微调（例如 20 轮）以在快速收敛的同时进一步提高性能。
使用新的激活函数 phi(x) = exp(-beta(1 - x)) 来调制缓存检索中的亲和度。

实验结果

研究问题

RQ1无训练、基于缓存的适配器是否能在少样本分类中达到或超过 SGD 微调的 CLIP-Adapter 的性能？
RQ2将少样本缓存与 CLIP 集成如何影响跨多样化数据集和骨干网络的零样本与少样本迁移？
RQ3从缓存初始化状态进行少量微调是否会带来更快的收敛和更高的准确性？

主要发现

Tip-Adapter 在没有任何训练的情况下达到与 CLIP-Adapter 具有竞争力的少样本性能。
Tip-Adapter-F（经过少量轮数的微调）在多个数据集和多种骨干网络上超越了所有比较方法。
基于缓存的初始化实现快速收敛，所需的训练轮数比 CLIP-Adapter 少得多（例如 20 轮对比 200 轮）。
缓存带来的性能提升随样本数增加而增大，但当缓存大小固定（实验中为 16）时，增益趋于饱和。
残差比 alpha 用于平衡自适应和先前的 CLIP 知识，在消融中最优值约为 alpha ≈ 1.0。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。