QUICK REVIEW

[论文解读] Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman|arXiv (Cornell University)|Mar 6, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文表明逐步级别的可视 grounding 忠实性（SGR）可以显著预测长时域的视觉-语言模型在分布外（OOD）上的泛化能力，且 grounding 质量是超越精度和规模的独立能力轴。

ABSTRACT

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

研究动机与目标

定义长时域行为忠实性为模型中间信念与不断演化的视觉输入之间的一致性程度。
量化逐步级别 grounding 及其与对分布外数据鲁棒性的关系。
证明 grounding 质量在模型规模和同分布精度之上独立变化。

提出的方法

提出一个四阶段框架来评估行为忠实性：推理提取、 grounding 验证、信念跟踪、受控扰动。
将 Step Grounding Rate (SGR) 计算为具有支持视觉 grounding 的推理步骤比例。
计算 Temporal Consistency Score (TCS) 以衡量信念随时间的稳定性。
计算 Hallucination Rate (HR) 和 Visual Reliance Score (VRS) 以量化 grounding 质量和依赖程度。
使用反事实轨迹和多重验证器（Faster R-CNN、跟踪、动作识别，以及替代检测器）来验证 grounding。
在八个模型、三个长时域基准（STAR、R2R、TEACh）及其 OOD 划分上进行评估。

实验结果

研究问题

RQ1时序 grounding 忠实性是否能够预测长时域 VLM 的分布外保留？
RQ2 grounding 质量是否是超越精度和规模的独立能力轴？
RQ3SGR 对提示、验证器和扰动的鲁棒性如何，是否确实反映了对视觉的真实依赖？

主要发现

SGR 在跨基准和模型上以 r = 0.83（p = 0.003）预测 OOD 保留。
在容量匹配的模型中， grounding 在 SGR 上的变化可达 10.8 个百分点，尽管精度相近，表明 grounding 是一个独立的能力轴。
更高的时序一致性（TCS）与更高的准确性相关（r = 0.87）。
在以准确度为条件的情况下，SGR 仍然对 OOD 表现具有预测性（偏相关 r = 0.68，p < 0.05）。
扰动分析表明对于所有扰动 |ΔSGR| > |ΔAcc|，表明 grounding 对视觉变化更敏感而非对最终答案。
反事实与跨架构的验证确认 SGR 捕捉到了真实的视觉依赖（例如，随机推理时 SGR 约为 18%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。