QUICK REVIEW

[论文解读] VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu, Renrui Zhang|arXiv (Cornell University)|Dec 4, 2021

Multimodal Machine Learning Applications被引用 28

一句话总结

VT-CLIP 通过使用可视引导的跨注意力模块来适应基于视觉时空特征的文本特征，从而提升在11个数据集上的少样本识别性能。

ABSTRACT

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

研究动机与目标

在少样本条件下提升 CLIP 的跨模态对齐。
使文本提示能够利用视觉上下文自适应聚焦于图像区域。
通过残差连接保留原始文本特征，以维持鲁棒性。
在各数据集上展示相对于基线（Zero-shot CLIP、CoOp、CLIP-Adapter）的性能改进。

提出的方法

引入一个可视引导的跨注意力模块，使文本查询视觉空间特征以自适应文本表示。
使用带冻结编码器的预训练 CLIP 组件；仅训练跨注意力模块。
利用上下文级空间图像特征（预池化）作为跨注意力的键/值。
应用残差连接将自适应后的文本特征与原始文本特征融合。
使用自适应后的文本特征计算相似度以获得最终分类分数。
在11个数据集上，以少样本设定（1、2、4、8、16 次样本）进行评估。

实验结果

研究问题

RQ1在少样本设置下，视觉引导的文本自适应是否能提升下游任务的跨模态对齐？
RQ2图像空间特征与文本特征之间的跨注意力如何影响 VT-CLIP 的类别级匹配？
RQ3在可视引导跨注意力模块中的架构选择（head 数、层数）对性能的影响如何？

主要发现

VT-CLIP 在少样本 setting 下在11个数据集上持续优于 Zero-shot CLIP、CoOp 和 CLIP-Adapter。
VT-CLIP 的准确率提升随着训练样本数量增加而增大。
VT-CLIP 在低样本 regime 中表现比 CoOp 更稳定。
消融研究表明跨注意力模块使用两个 heads 时表现最佳，增加更多级联层在少样本场景下可能降低性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。