QUICK REVIEW

[论文解读] SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

Cheng-Yu Hsieh, Jieyu Zhang|arXiv (Cornell University)|Jun 26, 2023

Text Readability and Simplification被引用 10

一句话总结

SugarCrepe 引入了一个对偏差固定的视觉-语言组合性基准，该基准使用由大型语言模型生成的难负样本和对抗性优化，揭示了先前方法在旧基准上被高估的提升。

ABSTRACT

In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-language models have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in all these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-language models. To remedy this rampant vulnerability, we introduce SugarCrepe, a new benchmark for vision-language compositionality evaluation. We employ large language models, instead of rule-based templates used in previous benchmarks, to generate fluent and sensical hard negatives, and utilize an adversarial refinement mechanism to maximally reduce biases. We re-evaluate state-of-the-art models and recently proposed compositionality inducing strategies, and find that their improvements were hugely overestimated, suggesting that more innovation is needed in this important direction. We release SugarCrepe and the code for evaluation at: https://github.com/RAIVNLab/sugar-crepe.

研究动机与目标

识别现有视觉-语言组合性基准中的偏差，这些偏差使非视觉模型能够通过漏洞获取优势。
开发一个新基准生成工作流，能够产生流畅、可信的难负样本。
减小导致基于伪影的性能提升的分布差异和伪影。
对近来提出的组合性方法和预训练 CLIP 模型进行公平重评估。

提出的方法

使用 ChatGPT 从正向描述中生成流畅且可信的难负样本。
人工验证难负样本，以过滤错误负样本。
应用对抗性优化流程，平衡分数差分布并移除可利用的偏差。
覆盖七种细粒度的难负样本类型以测试组合理解（在对象/属性/关系上进行替换、交换、添加）。
在 SugarCrepe 上评估现有的组合性方法以及大规模预训练 CLIP 模型，并与早期基准进行比较。

实验结果

研究问题

RQ1现有的视觉-语言组合性基准是否存在偏差，使得非视觉模型在不使用图像的情况下也能取得优势？
RQ2用大语言模型和对抗性优化生成的基准是否能提供对组合理解更真实的衡量？
RQ3与传统基准相比，最近的组合性方法和大型预训练 CLIP 模型在 SugarCrepe 上的表现如何？

主要发现

现有基准高度容易被破解；仅文本的模型可以通过利用无意义且不流畅的难负样本来超过视觉-语言模型。
SugarCrepe 通过 LLM 生成的难负样本和对抗性优化来降低这些偏差，使分数差分布对称化。
NegCLIP 类型的难负样本增强在旧基准上显示出较大提升，但在 SugarCrepe 上的提升要小得多，表明对伪影的过拟合。
在 SugarCrepe 上，最佳预训练 CLIP 模型仍然落后于人类表现，尤其是在 Swap 和与属性/关系相关的负样本上。
SugarCrepe 表明模型性能与 ImageNet 零样本准确率相关，不同难负样本类别之间相关性强度不同。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。