QUICK REVIEW

[論文レビュー] Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman|arXiv (Cornell University)|Mar 6, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

要約: 本研究は、ステップレベルの視覚 grounding 忠実度 (SGR) が長期視覚-言語モデルのOOD一般化を強く予測し、 grounding 品質は精度やスケールを超えた独立した能力軸であることを示す。

ABSTRACT

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

研究の動機と目的

長期的な行動の忠実度を、モデルの中間信念が進化する視覚入力と一貫して一致する度合いとして定義する。
ステップレベルの grounding と、それがOODデータに対する頑健性とどう関係するかを定量化する。
grounding 品質はモデルサイズとインディストリビューション内精度とは独立した軸として変動することを示す。

提案手法

行動的忠実度を評価する4段階のフレームワークを提案する：推論抽出、 grounding検証、信念追跡、制御された攪乱。
Reasoning steps のうち supported visual grounding を有する割合として Step Grounding Rate (SGR) を計算する。
Temporal Consistency Score (TCS) を計算して時間を通じた信念の安定性を測る。
Grounding 品質と依存度を定量化するために Hallucination Rate (HR) と Visual Reliance Score (VRS) を計算する。
counterfactual traces と複数の検証者（Faster R-CNN、追跡、アクション認識、および代替検出器）を用いて grounding を検証する。
STAR、R2R、TEACh の3つの長期ベンチマークとそれぞれのOOD分割の8モデルで評価する。

実験結果

リサーチクエスチョン

RQ1時系列 grounding fidelity は長期VLMのOOD保持を予測するか？
RQ2grounding quality は精度とスケールを超えた独立した能力軸か？
RQ3SGR はプロンプト、検証者、攪乱に対してどれだけ頑健で、視覚依存を真に反映しているか？

主な発見

SGR は benchmark およびモデル全体で r = 0.83 (p = 0.003) のOOD保持を予測する。
容量が一致するモデル内では、同様の精度にもかかわらず SGR は最大で 10.8 ポイントの grounding のばらつきを示し、grounding が独立した軸であることを示す。
より高い時系列一貫性（TCS）は精度の高さと相関（r = 0.87）。
grounding を精度で条件付けしても、SGR はOOD性能を予測する（偏相関 r = 0.68, p < 0.05）。
攪乱分析では、すべての攪乱において |ΔSGR| > |ΔAcc| が観測され、grounding が最終回答より視覚的変化に敏感であることを示す。
counterfactual およびクロスアーキテクチャ検証は SGR が genuine な視覚依存を捉えていることを確認（例：ランダム推論で SGR ≈ 18% 程度）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。