QUICK REVIEW

[论文解读] Learning to Compose Soft Prompts for Compositional Zero-Shot Learning

Nihal V. Nayak, Peilin Yu|arXiv (Cornell University)|Apr 7, 2022

Domain Adaptation and Few-Shot Learning被引用 41

一句话总结

我们提出组合式软提示（CSP），它学习属性和对象词汇标记以提升基于 CLIP 的组合零样本学习，在标准基准上相对于 CLIP 和 CoOp 取得显著提升。

ABSTRACT

We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) like CLIP. We develop CSP for compositional zero-shot learning, the task of predicting unseen attribute-object compositions (e.g., old cat and young tiger). VLMs have a flexible text encoder that can represent arbitrary classes as natural language prompts but they often underperform task-specific architectures on the compositional zero-shot benchmark datasets. CSP treats the attributes and objects that define classes as learnable tokens of vocabulary. During training, the vocabulary is tuned to recognize classes that compose tokens in multiple ways (e.g., old cat and white cat). At test time, we recompose the learned attribute-object vocabulary in new combinations to recognize novel classes. We show that CSP outperforms the CLIP on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that fine-tunes the prefix context tokens, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to higher-order attribute-attribute-object compositions (e.g., old white cat) and combinations of pretrained attributes and fine-tuned objects. The code is available at https://github.com/BatsResearch/csp.

研究动机与目标

通过学习将属性-对象概念组合为可适应的词汇标记来提升视觉-语言模型的零样本组合能力。
在测试时实现对已学习的属性-对象提示进行再组合法，以适应未见类的组合。
通过仅微调少量词汇标记而非对整个模型进行微调，保持参数效率。

提出的方法

将属性和对象视为 VLM 词汇表中的可学习标记。
从预训练的 CLIP 嵌入初始化标记，并在多个属性-对象提示上进行训练。
构建形式为 "A photo of [attribute] [object]" 的提示，使用固定前缀上下文和可学习标记。
通过在 VLM 嵌入空间中使用余弦相似度计算图像—文本兼容性，并以交叉熵损失进行优化。
推理阶段，重新组合学习到的属性/对象词汇表以识别新颖的组合。
保持较小的参数规模：仅训练 (|A|+|O|) × d 个参数。

实验结果

研究问题

RQ1学习可组合的属性和对象标记是否能提升基于 CLIP 的模型的零样本组合性？
RQ2CSP 对更高阶的组合以及混合预训练/微调词汇表的泛化能力如何？
RQ3在标准数据集上，与 CLIP 和软提示基线相比，CSP 的提升是多少？
RQ4在属性-对象组合上的训练是否能泛化到属性-属性-对象和未见属性情景？

主要发现

Dataset	S	U	H	AUC
MIT-States	46.6	49.9	36.3	19.4
UT-Zappos	64.2	66.2	46.6	33.0
C-GQA	28.8	26.8	20.5	6.2

CSP 在闭集设置中，平均 AUC 准确率相较 CLIP 提升 10.9 个百分点。
CSP 在同一指标上相较 CoOp 提升 5.8 个百分点。
在开放世界设置中，CSP 实现显著提升（如 MIT-States、UT-Zappos、C-GQA），并且在多组数据集上常常超过某些任务特定架构。
CSP 能泛化到更高阶的组合（属性-属性-对象），未见属性的准确率相对于 CLIP 有所提升。
在属性-对象组合上进行训练可提升 CLIP 在属性分类、属性-属性-对象以及混合词汇场景的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。