QUICK REVIEW

[논문 리뷰] CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini, Chetouani, Mohamed|arXiv (Cornell University)|2026. 02. 09.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

CLUE는 비전-언어 모델의 교차 모달 주의를 명시적인 공간 신호로 바꿔 지시물이 지칭하는 대상의 모호성을 탐지하고 인터랙티브 시각적 점유에서 명확화 질문 여부를 결정하며, InViG에서 매개변수 효율적 LoRA 미세조정으로 최첨단 성과를 달성한다.

ABSTRACT

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue

연구 동기 및 목표

비주얼 씬에서 지시가 불충분할 수 있음을 감지할 수 있는 인터랙티브 시각적 점유 IVG를 고무한다.
VLM의 교차 모달 주의를 공간적이고 명시적인 모호성 신호로 변환한다.
씬에서 혼동 영역을 지역화하는 모호성 탐지기를 개발한다.
모호성 탐지에 guided 된 명확화 대화를 통해 엔드-투-엔드 IVG를 시연한다.
실세계 IVG 데이터에서 baselines를 능가하기 위한 매개변수 효율적인 미세조정(LoRA)을 보여준다.

제안 방법

사전 학습된 VLM 디코더에서 텍스트-이미지 교차 주의 맵을 추출한다.
집계된 주의 맵에서 참조 모호성을 탐지하고 이를 공간적으로 지역화하기 위해 경량 CNN을 학습한다.
LoRA 어댑터로 Gemma2 기반 디코더를 두 가지 작업(모호성 탐지 및 IVG 대화 근거화)으로 미세조정한다.
특수 조건 토큰 “clarify”를 사용해 모델이 명확화 질문을 할지 근거 위치 토큰을 출력할지 방향을 조정한다.
InViG 데이터셋(실세계)에서 InViG 전용 감독으로 엔드-투-엔드 IVG를 학습하고 최첨단 방법과 비교 평가한다.
추론 중 모호성이 탐지되면 명확화 질문을 생성하고, 그렇지 않으면 근거 위치 토큰을 출력한다.

Figure 1: Problem illustration: when an instruction is underspecified, the robot should detect it and ask for clarification (AI generated, then edited)

실험 결과

연구 질문

RQ1비전-언어 모델의 교차 모달 주가가 grounded 지시에서 지칭 모호성을 신뢰성 있게 나타낼 수 있는가?
RQ2주의 맵에 기반한 CNN 모호성 탐지기가 휴리스틱 또는 토큰 기반 모호성 신호를 능가하는가?
RQ3LoRA로 미세조정된 VLM이 매개변수 효율성을 유지하면서 경쟁력 있는 IVG 성능을 달성할 수 있는가?
RQ4모호성 신호가 분포 내/분포 외(real-world) 데이터에 얼마나 일반화되는가?

주요 결과

주목 맵에 CNN을 적용한 모호성 탐지기가 강한 성능을 보이며, Half-Last Detect(CNN)는 Dataset 1에서 F1 0.846, Dataset 2(OOD)에서 0.765를 달성한다.
Half-depth 디코더 사용이 일반화에 더 좋고, Full-Last Disambig. (AR)가 실세계 OOD 데이터에서 0.702로 떨어지는 반면 Half-Full Disambig. (AR)는 0.836을 달성한다.
InViG 전용으로 미세조정된 CLUE가 IVG 작업에서 처음부터 학습된 최첨단 baselines(TiO)을 능가하며; Mix-LoRA 변형은 InViG에서 약 75.66% Acc@0.5에 도달한다( TiO의 71.2%와 비교 ).
객체 탐지 데이터로의 프리튜닝(mix)은 공간적 편향을 제공해 비혼합 변형 대비 IVG 성능을 향상시킨다.
제로샷 baselines(Gemma 변형)은 LORA로 미세조정된 CLUE에 비해 시뮬레이션 및 실제 데이터 양쪽에서 성능이 떨어진다.

Figure 2: Overall CLUE architecture. An RGB image is encoded by SigLIP and projected by an MLP. The text prefix is tokenized and passed with the image tokens into a Gemma2 decoder equipped with LoRA adapters. The decoder both (i) autoregressively generates clarification questions and (ii) exposes cr

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.