QUICK REVIEW

[논문 리뷰] Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman|arXiv (Cornell University)|2026. 03. 06.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

본 논문은 단계 수준 시각적 근거 신뢰성(SGR)이 장기-시야 비전 모델의 OOD 일반화와 강하게 예측함을 보여주며, 근거 품질은 정확도와 규모를 넘어선 독립적인 능력 축임을 시사한다.

ABSTRACT

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

연구 동기 및 목표

장기간에 걸친 행동적 충실성을 모델의 중간 신념이 변화하는 시각 입력과 얼마나 일관되게 일치하는지로 정의한다.
단계 수준 근거를 정량화하고 이것이 OOD 데이터에 대한 강건성에 미치는 관계를 규명한다.
근거 품질이 모델 크기와 in-distribution 정확도와 독립적으로 달라지는 것을 입증한다.

제안 방법

행동적 충실성을 평가하기 위한 네 단계 프레임워크를 제안한다: 추론 추출, 근거 검증, 신념 추적, 제어된 교란.
지지되는 시각적 근거를 가진 추론 단계의 비율로 Step Grounding Rate(SGR)을 계산한다.
시간에 걸친 신념 안정성을 측정하기 위한 Temporal Consistency Score(TCS)를 계산한다.
근거 품질과 의존도를 정량화하기 위한 Hallucination Rate(HR) 및 Visual Reliance Score(VRS)를 계산한다.
반사실(trace) 트레이스와 다중 검증기(Faster R-CNN, 트래킹, 행동 인식, 대체 탐지기)를 사용하여 근거를 검증한다.
세 가지 장기-호라이즌 벤치마크(STAR, R2R, TEACh) 및 그들의 OOD 분할에 걸쳐 여덟 개 모델을 평가한다.

실험 결과

연구 질문

RQ1장기-호라이즌 VLM에서 시간적 근거 충실도가 OOD 유지력을 예측하는가?
RQ2근거 품질이 정확도와 규모를 넘어서는 독립적인 모델 능력 축인가?
RQ3SGR이 프롬프트, 검증기, 교란에 대해 얼마나 강건하며, 실제 시각 의존성을 반영하는가?

주요 결과

SGR은 벤치마크와 모델 전반에서 r = 0.83(p = 0.003)으로 OOD 유지력을 예측한다.
능력 매칭 모델 간에도 동일한 정확도에도 불구하고 SGR에서 근거는 최대 10.8 퍼센트포인트까지 다르게 나타나며, 근거가 독립적인 축임을 보여준다.
더 높은 시간적 일관성(TCS)은 더 높은 정확도와 상관관계가 있다(r = 0.87).
정확도를 조건으로 할 때도 SGR은 OOD 성능을 예측하는 것으로 남아 있으며(부분 상관 r = 0.68, p < 0.05).
교란 분석은 모든 교란에서 |ΔSGR| > |ΔAcc|를 보여주어, 근거가 최종 답변보다 시각 변화에 더 민감함을 시사한다.
반사실 및 교차 아키텍처 검증은 SGR이 진정한 시각 의존성을 포착함을 확인한다(예: 무작위 추론은 SGR ≈ 18%).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.