QUICK REVIEW

[论文解读] CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini, Chetouani, Mohamed|arXiv (Cornell University)|Feb 9, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

CLUE 将跨模态注意力从视觉-语言模型转化为显式的空间信号，用于检测指代歧义并决定在交互式视觉定位中何时请求澄清，在 InViG 上通过参数高效的 LoRA 微调实现最先进结果。

ABSTRACT

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue

研究动机与目标

在视觉场景中能够检测指令是否未充分给出信息时，推动交互式视觉定位（IVG）。
将 VLM 跨模态注意力转换为空间的、显式的歧义信号。
开发一个歧义检测器，用于在场景中定位困惑区域。
展示端到端的 IVG，辅以由歧义检测引导的澄清对话。
展示参数高效的微调（LoRA），在真实世界的 IVG 数据上超越基线。

提出的方法

从预训练的 VLM 解码器中提取文本到图像的跨注意力图。
在聚合的注意力图上训练一个轻量级的卷积神经网络，用于检测指代歧义并在空间上定位它。
使用 LoRA 适配器对 Gemma2 为基础的解码器进行微调，用于两项任务：歧义检测和 IVG 对话定位。
使用特殊的条件标记 “clarify” 将模型引导至提出澄清问题或输出定位令牌以进行定位。
在 InViG 数据集（真实世界）上进行端到端的 IVG 训练，使用 InViG-only 监督，并与最先进方法进行对比评估。
在推理阶段，如果检测到歧义，生成澄清性问题；否则输出用于定位的地点令牌。

Figure 1: Problem illustration: when an instruction is underspecified, the robot should detect it and ask for clarification (AI generated, then edited)

实验结果

研究问题

RQ1跨模态注意力图是否能可靠地指示对 grounded 指令中的指代歧义？
RQ2基于注意力图的 CNN 歧义检测器是否优于基于启发式或基于令牌的歧义信号？
RQ3LoRA 微调的 VLM 是否在保持参数效率的同时实现具竞争力的 IVG 性能？
RQ4歧义信号在分布内与分布外（真实世界）数据上的泛化能力如何？

主要发现

使用注意力图的 CNN 歧义检测器表现出较强的性能，Half-Last Detect (CNN) 在数据集 1 上的 F1 达到 0.846，在数据集 2（OOD）上达到 0.765。
半深度解码器的使用具有更好的泛化能力，Full-Last Disambig. (AR) 在真实世界 OOD 数据集上降至 0.702，而 Half-Full Disambig. (AR) 达到 0.836。
InViG-only 微调的 CLUE 在 IVG 任务上超越从零开始训练的最先进基线（TiO）；Mix-LoRA 变体在 InViG 上达到约 75.66% 的 Acc@0.5（对比 TiO 的 71.2%）。
以对象检测数据进行预训练（混合）提供了关键的空间先验，提升了 IVG 相对于非混合变体的性能。
零-shot 基线（Gemma 变体）在模拟和真实世界数据上均不及通过 LoRA 微调的 CLUE。

Figure 2: Overall CLUE architecture. An RGB image is encoded by SigLIP and projected by an MLP. The text prefix is tokenized and passed with the image tokens into a Gemma2 decoder equipped with LoRA adapters. The decoder both (i) autoregressively generates clarification questions and (ii) exposes cr

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。