QUICK REVIEW

[论文解读] VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Neil Tripathi|arXiv (Cornell University)|Mar 3, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VB 是一个基准，用于评估视觉-语言模型在仅凭单张图像、一个简短问题、可 abstain 的前提下，是否能判断可见性主张是否被支持，并结合最小改动扰动与二阶视角推理，报告带有置信度的、可 abstain 的准确性，并分析不同模型的鲁棒性与校准情况。

ABSTRACT

We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.

研究动机与目标

评估视觉-语言模型是否能从单张图像和一个简短问题中验证可见性主张。
评估模型对受控的最小图像编辑的鲁棒性，该编辑应改变正确标签。
在人类观者无法从照片可靠回答时，测试经校准的回避能力。
通过 MULTI_AGENT/SECOND_ORDER 切片，检验二阶透视推理。
提供公开数据集与可用于可见性为基础的 VQA 的评估基础设施。

提出的方法

提出一个 2x2 的族设计，将最小图像编辑与最小文本编辑交叉组合，形成每个族的四个评估单元。
使用三个标题单元（BASE、TEXT_FLIP、IMAGE_FLIP）和一个诊断单元（DOUBLE_FLIP）来计算严格的基于异或（XOR）的评分子集。
定义标签 VISIBLY_TRUE、VISIBLY_FALSE、ABSTAIN，并给出每个条目的置信度分数。
引入八类可见性分类法及与具体可见性因素相关的推理编码（如 OCCLUSION、OUT_OF_FRAME、GAZE_DIRECTION 等）。
计算包括带有 abstention 的置信度感知准确性（CAA）、最小编辑翻转率（MEFR）、基于置信度的选择性预测（SelRank）以及用于二阶推理的 ToMAcc。

实验结果

研究问题

RQ1视觉-语言模型是否能正确判断图像中的像素是否支持某一可见性主张？
RQ2最小的图像或文本编辑是否如预期地翻转正确标签，模型是否依赖证据变化？
RQ3在无法从图像获得自信答案时，模型是否能够 abstain？
RQ4在单张图像的基础上，模型的二阶透视推理能力如何？
RQ5开源模型与旗舰的闭源模型在可见性推理任务上的表现差异？”

主要发现

GPT-4o 与 Gemini 3.1 Pro 在所有模型中获得最高的综合 FinalScore（约 0.728）。
开源 Gemma 3 12B 达到 0.505，表明在 8–12B 规模的开源模型也可超越部分早期代闭源系统。
文本翻转的鲁棒性普遍高于图像翻转的鲁棒性，在九个模型中有六个模型表现出文本增强的鲁棒性差距。
在 ToMAcc（二阶推理）方面，旗舰闭源模型与开源模型之间存在显著差距。
校准与 abstention 行为在不同模型间差异很大，有些模型对正确答案非常有信心，而另一些模型表现出反信息的置信排序。
MEFR 结果显示，对于多数模型，文本编辑比图像编辑更常被可靠处理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。