[论文解读] When and why vision-language models behave like bags-of-words, and what to do about it?
本文提出 Attribution, Relation, and Order (ARO) 基准测试,以诊断视觉-语言模型的组成理解,并证明当前模型在关系、属性和顺序方面表现不佳;随后显示,面向组成的困难负样本可以显著提升性能。
Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.
研究动机与目标
- 评估视觉-语言模型在字幕与图像中编码对象属性、关系和单词顺序的能力。
- 创建大规模基准,测量属性、关系和顺序理解。
- 分析为何以检索为中心的训练可能忽视组成能力。
提出的方法
- 构建 Visual Genome Attribution 和 Visual Genome Relation 任务,通过对 ground-truth 与 swapped 情况交换字幕来测试对象属性与关系。
- 创建 COCO Order 与 Flickr30k Order 任务,使用系统扰动测试模型对字幕单词顺序的敏感性。
- 在 ARO 基准上评估四个最先进的 VLM(CLIP, BLIP, Flava, X-VLM)。
- 批评检索和对比学习在组成理解方面的不足评估。
- 提出面向组成的困难负样本挖掘,通过在微调期间生成负字幕并抽样最近邻图像。
- 演示 NegCLIP 提高顺序/关系理解,且对下游任务几乎不造成损失。
实验结果
研究问题
- RQ1VLMs 是否在图像中可靠地理解关系和属性组合?
- RQ2VLMs 对描述可视化场景的字幕中的单词顺序是否敏感?
- RQ3在检索数据集上的对比学习预训练是否会导致模型忽视组成和顺序线索?
- RQ4面向组成的困难负样本挖掘是否能在不损害其他任务的情况下改善组成理解?
主要发现
- VLMs 在关系理解和属性交换方面基本失败,对字幕顺序几乎不敏感。
- 模型在 Visual Genome Relation 和 Attribution 任务上表现接近随机;例如,空间关系准确率波动较大但普遍较低,动词难度很高,属性因模型而异。
- 顺序敏感性测试(COCO Order, Flickr30k Order)显示模型在正确排序的偏好很小,甚至在所测试模型中几乎没有偏好。
- 即使在顺序/组成线索被扰动时,检索性能仍然很高,表明基于检索的评估隐藏了组成缺陷。
- 引入面向组成的困难负样本(NegCLIP)带来显著提升:VG-Relation 63% 到 81%,VG-Attribution 62% 到 71%,COCO Order 46% 到 86%,Flickr30k Order 59% 到 91%,下游任务几乎无下降。
- NegCLIP 在若干组成基准上变得具有竞争力甚至优于其他方法,同时在 CIFAR-10/100、ImageNet、Flickr30k、COCO 上保持表现。)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。