QUICK REVIEW

[论文解读] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Amita Kamath, Jack Hessel|arXiv (Cornell University)|Feb 26, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本文认为 vision-language 数据中的报告偏差抑制了四种核心推理能力（空间、时间、否定、计数）；仅靠扩大模型/数据规模和多语言性无法解决这个问题，但有针对性的标注者指令和故意的数据收集可以提升 VLM 的推理能力。

ABSTRACT

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

研究动机与目标

研究 vision-language 数据中的报告偏差是否抑制关键推理技能。
评估扩大数据/模型规模或多语言数据是否能缓解 VLM 中推理的低 Representation。
评估标注者指令是否能缓解报告偏差并通过微调提升推理能力。

提出的方法

作者分析三个开源图像文本语料库（LAION、LLaVA-1.5、PixMo）以及流行 VLMs 的训练数据，以关键字出现频率和人工验证的估计量化四种推理类型的低 Representation。
他们整理了四个推理基准（空间、计数、否定、时间），并在它们上评估多种对比学习模型（OpenCLIP 变体）和生成式模型（LLaVA-1.5、Molmo 等）。
通过改变数据规模（LAION-80M/400M/2B）和模型规模进行缩放定律实验，并通过将字幕翻译成英文来评估多语言多样性。
他们进行标注者指令研究和一个受控字幕化实验，以衡量指导性指令如何影响字幕中推理概念的出现频率。
他们尝试使用以计数为焦点的数据集进行微调，以评估增加的推理数据是否转化为性能提升。

Figure 1: Examples from LAION-2B of data points that contain reasoning-related keywords that do and do not operationalize the reasoning capability itself.

实验结果

研究问题

RQ1vision-language 网络数据的报告偏差是否导致空间、时间、计数和否定推理的低 representation？
RQ2仅通过增加数据规模、模型规模或多语言数据，是否会在 VLMs 中产生新兴的推理能力？
RQ3标注者指令是否能缓解报告偏差并改善 VLM 推理能力，且在不进行大规模再训练的情况下是否足够？
RQ4有针对的数据收集如何影响所提出的推理基准的性能？

主要发现

在开源图像文本语料中，推理相关概念极为罕见（例如 LAION 的空间推理约为 0.1%；即使在大规模数据下也很少表示这些技能）。
扩展数据和模型规模并不能可靠地产生空间、时间、否定或计数推理的涌现，单独的多语言扩展也可能无效；某些模型仍远低于人类水平。
标注者指令显著增加字幕中目标推理信号的存在（如空间、计数、否定、时间），并且用含推理丰富的数据进行微调可获得改进，表明数据质量至关重要。
开源生成模型在平均上优于对比模型，但在否定和时间推理方面与人类表现仍相差甚远。
这些结果共同主张通过有意识的数据收集与标注策略来提升 VLM 推理能力，而非仅靠扩展规模。

Figure 2: Examples from our four benchmarks for contrastive and generative evaluations. The generative evaluation is in MCQ format but for counting, for which a free form output with a given range yielded higher scores.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。