QUICK REVIEW

[论文解读] Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Manar Salamah Ali, Judith Sieker|arXiv (Cornell University)|Jan 12, 2026

Neurobiology of Language and Bilingualism被引用 0

一句话总结

该论文使用颜色网格参考任务来评估视觉-语言模型是否能通过明确的澄清请求表达内部不确定性，发现跨多个模型在不确定性与澄清行为之间的对齐度有限。

ABSTRACT

In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.

研究动机与目标

将参考游戏作为一个受控任务，其中歧义和澄清需求是显性且可衡量的。
评估视觉-语言模型是否能够识别自身的不确定性并在参考游戏环境中生成澄清性问题。
在多模型下将基线参考解析性能与专门的澄清实验进行比较。
评估人类在环反馈是否改善澄清情景中的模型性能。

提出的方法

使用颜色网格参考游戏数据集（197 局 × 60 轮）包含三个网格（目标 + 两个干扰项）。
在基线任务和澄清任务上测试三种模型（Qwen2.5-VL-7B、Qwen2.5-VL-72B、GPT-5-mini）。
基线：提供说话者话语和三个网格；模型作为受话人并预测目标。
澄清实验：提示模型在不确定时提出澄清问题；抽取回应样本并计算 CR-Rate、正确率和放宽正确率。
采用多样性抽样和基于 MSP 的不确定性估计（如适用）来量化模型信心。
可选的人类在环交互以评估澄清是否提升端到端性能。

Figure 1: Example item from the dataset. The speaker referred to the second item with the description “Bottom left is bright pink.”

实验结果

研究问题

RQ1视觉-语言模型是否能够在受控的参考游戏中识别自身不确定性并将其转化为显式的澄清请求？
RQ2基线参考解析性能与模型信心与在不同难度水平（远、分割、接近）下提出澄清问题的倾向之间有何关系？
RQ3澄清请求是否提高端到端准确性，人类在环反馈是否能增强性能？
RQ4在本任务中，不同模型家族（Qwen-2.5-VL vs GPT-5-mini）在澄清行为方面是否存在系统性差异？

主要发现

GPT-5-mini 在总体基线准确率上达到最高（≈91%），且信心极高（≈99%）。
Qwen-72B 展现相对较强的基线性能（≈77% 准确率）和高信心（≈91%）。
Qwen-7B 的基线准确率显著较低（≈53%），但信心较高（≈88%）。
在澄清实验中，GPT-5-mini 大约在 13% 的题目中生成澄清请求，Qwen-72B 大约在 24% 的题目中如此，而 Qwen-7B 几乎不请求澄清（<0.1%）。
在会得到澄清的题目上，GPT-5-mini 的模型准确率往往低于基线，表明澄清用于更难的题目；Qwen-72B 根据子集与全数据集显示出混合模式。
总体而言，模型在内部不确定性与澄清行为之间的对齐度有限，表明即使在自洽的参考游戏中也存在务实能力的差距。
对澄清的人工在环反馈很少提升端到端性能，且许多澄清并非与任务相关或信息性强。

Figure 2: Sankey diagrams for GPT-5-mini (left) and Qwen2.5 VL-72B (right) showing each model’s outcomes in the baseline and clarification experiments. The flow indicates for which baseline items clarification requests were generated and how consistent responses were.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。