QUICK REVIEW

[論文レビュー] LVLMs and Humans Ground Differently in Referential Communication

Peter Zeng, Weiling Li|arXiv (Cornell University)|Jan 27, 2026

Speech and dialogue systems被引用数 0

ひとこと要約

この研究は、4つの director-matcher ペアリング（人間-人間、人間-AI、AI-人間、AI-AI）における多ラウンド指示参照タスクで、 humans と LVLMs が参照表現を grounding する方法を比較し、人間は共通地平を形成して簡潔な表現に同調する一方、LVLMs はこれを達成できないことを明らかにする。

ABSTRACT

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.

研究の動機と目的

人間と LVLMs が参照的コミュニケーションにおいて共通地平を形成する方法を調査する。
4つのペアリング構成（HH、human-AI、AI-human、AI-AI）全体で grounding プロセスがどのように異なるかを検討する。
複数ラウンドにわたる正確さ、コミュニケーション労力、語彙的 entrainment を測定する。
モデルの prompting と相互作用ダイナミクスが LVLMs の grounding 行動に与える影響を分析する。

提案手法

4条件（HH、human-AI、AI-human、AI-AI）で director-matcher ロールを用いた因子設計の多ラウンド指示参照実験を実施する。
オンラインプラットフォーム（oTree）を用い、語彙化されたラベルなしで basket を説明するテキスト対話を収集する。
AI 条件すべてに GPT-5.2 を LVLM として使用し、課題コンテキストを強制し固定 JSON/ゼロショット・チェーンオブコトを用いた慎重に設計された prompts を適用する。
参照表現を自動抽出（人間の注釈と照合して検証）し、 grounding 指標（正確さ、単語/ターン数、語彙的重なり）を計算する。
ラウンドごとの傾向を ordinary least squares 回帰で分析し、充実化/同調と効率の変化を評価する。

Figure 1: Repeated referring to two baskets (non-lexicalized objects) by a human-human pair in Rounds 1-4 of our experiment, with lexical overlap highlighted in blue. Entrainment on more concise language (a conceptual pact) occurs by Round 3, after they consider multiple proposals in Rounds 1-2.

実験結果

リサーチクエスチョン

RQ1LVLM は人間のように多ターンの参照タスクで共通地平を形成・活用できるか。
RQ2 director/matcher の役割は、人間とAIパートナー間で grounding の成功と効率にどのように影響するか。
RQ3 LVLM は人間あるいは他の LVLM との協働時に語彙的 entrainment やコミュニケーション労力の低減を示すか。
RQ4 prompts や推論設定など、どのメカニズムが LVLM の参照対象の grounding 能力に影響するか。

主な発見

条件	正確さ	単語数	ターン数	RE 単語数	語彙的重なり
HH	4.0**	-74.9***	-4.1**	-36.6***	0.0
AA	-5.3***	10.6	-0.3	-1.8	0.0
AH	-15.3***	-24.7	0.2	-37.3***	-0.1***
HA	-1.1	-129.1	-5.7*	-32.0*	0.0

人間は高い正確さを達成し、ラウンドを重ねるごとに上昇する（HH は約80%から>90%へ）、語数とターン数が減少し、明確な語彙的 entrainment が見られる。
AI-AI は初期高い正確さを示すがラウンドを重ねるにつれて低下し、長く一貫した参照表現を用い、entrainment はほとんど見られない。
Human-AI および AI-Human は初期正確さが低く、改善も限定的；AI ディレクターはしばしば冗長で非コンパクトな記述を生成し、効果的な grounding に失敗する。
AI パートナーは grounding の履歴に適応せず、直感的な Gricean の語用論に反し、持続的な非効率と潜在的な齟齬を招く。
タスク後のアンケートでは、人間は AI パートナーを能力不足かつ協調性が低いと評価し、人間-AI 協働における grounding タスクの実践的課題を浮き彫りにする。

Figure 2: Trends over four rounds for (from left to right) accuracy (%), numbers of words , number of turns , number of words referring expressions , and proportion of lexical overlap by director–matcher condition. Dots show means with 95% CIs, with each color denoting a specific pairing condition.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。