QUICK REVIEW

[論文レビュー] Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Manar Salamah Ali, Judith Sieker|arXiv (Cornell University)|Jan 12, 2026

Neurobiology of Language and Bilingualism被引用数 0

ひとこと要約

The paper uses color-grid reference games to evaluate whether vision-language models can express internal uncertainty through explicit clarification requests, finding limited alignment between uncertainty and clarifying behavior across several models.

ABSTRACT

In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.

研究の動機と目的

Motivate reference games as controlled tasks where ambiguity and clarification needs are explicit and measurable.
Evaluate whether vision-language models can recognize their own uncertainty and generate clarifying questions in a reference game setting.
Compare baseline reference resolution performance with a dedicated clarification-experiment across multiple models.
Assess whether human-in-the-loop feedback improves model performance in clarification scenarios.

提案手法

Use a color-grid reference game dataset (197 games × 60 rounds) with three grids (target + two distractors).
Test three models (Qwen2.5-VL-7B, Qwen2.5-VL-72B, GPT-5-mini) on a baseline task and a clarification task.
Baseline: provide speaker utterance and three grids; model acts as addressee and predicts target.
Clarification experiment: prompt models to ask a clarification when uncertain; sample responses and compute CR-Rate, accuracy, and relaxed accuracy.
Diversity sampling and MSP-based uncertainty estimates (where applicable) to quantify model confidence.
Optional human-in-the-loop interaction to assess whether clarifications improve end-to-end performance.

Figure 1: Example item from the dataset. The speaker referred to the second item with the description “Bottom left is bright pink.”

実験結果

リサーチクエスチョン

RQ1Can vision-language models identify their own uncertainty and translate it into explicit clarification requests in a controlled reference game?
RQ2How do baseline reference-resolution performance and model confidence relate to the propensity to ask clarifying questions across different difficulty levels (far, split, close)?
RQ3Do clarification requests improve end-to-end accuracy, and does human-in-the-loop feedback enhance performance?
RQ4Are there systematic differences in clarification behavior across model families (Qwen-2.5-VL vs GPT-5-mini) in this task?

主な発見

GPT-5-mini achieves the highest overall baseline accuracy (≈91%) with very high confidence (≈99%).
Qwen-72B shows relatively strong baseline performance (≈77% accuracy) and high confidence (≈91%).
Qwen-7B has substantially lower baseline accuracy (≈53%) but high confidence (≈88%).
In the clarification experiment, GPT-5-mini generates clarification requests in about 13% of items, while Qwen-72B does so in about 24%, and Qwen-7B nearly never requests clarification (<0.1%).
Model accuracy on items that yield clarifications tends to be lower than baseline for GPT-5-mini, suggesting clarifications are used on harder items; Qwen-72B shows mixed patterns depending on subset vs full dataset.
Overall, models display limited alignment between internal uncertainty and clarifying behavior, indicating a gap in pragmatic abilities even in self-contained reference games.
Human-in-the-loop responses to clarifications rarely improve end-to-end performance, and many clarifications are not task-relevant or informative.

Figure 2: Sankey diagrams for GPT-5-mini (left) and Qwen2.5 VL-72B (right) showing each model’s outcomes in the baseline and clarification experiments. The flow indicates for which baseline items clarification requests were generated and how consistent responses were.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。