QUICK REVIEW

[論文レビュー] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama, Soyeon Caren Han|arXiv (Cornell University)|Feb 24, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は Vision-Language Causal Graphs (VLCGs) と診断ベンチマーク ViLCaR を提案し、因果帰属と推論を評価する。構造化された関連性ガイダンスは最終回答の正確性を必ずしも高めるわけではないが、帰属と推論の一貫性を改善する。

ABSTRACT

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

研究の動機と目的

LVLMs の最終回答の正確性だけでなく、因果推論を診断する必要性を動機づける。
VLCGs を、与えられた画像-質問対の因果的に関連する要素の構造化されたクエリ条件付き表現として導入する。
因果帰属、因果推論、質問応答タスクを含む診断ベンチマーク ViLCaR を作成する。
グラフ整列型の評価指標を開発し、関連性の特定と最終回答の正確性を分離する。
構造化された因果ガイダンスが LVLMs における帰属と推論の一貫性を改善することを示す。

提案手法

VLCGs を有向グラフ G=(V,E,A) として定義し、オブジェクト/属性/関係 (V) を因果依存性 (E) および明示的な場面に根ざす前提 (A) と結びつける。
ViLCaR を VQA/VCR データから因果フィルタリング、LVLM プロンプティングによる VLCG 生成、独立検出器によるグラウンディング、最小限の因果剪定、人間による品質管理を経て構築する。
CA（因果的に関連する属性の特定）および CI（VLCG による推論の一貫性）と QA の正確性の三つの診断タスクを用い、ゼロショット、標準 ICL、VLCG-Augmented Prompting の設定で LVLM を評価する。
CA と CI を評価するためのグラフ整列指標を用い、QA の正確性に加えて因果の同定と推論の一貫性を測定する。
生成された推論を金標準の VLCG の因果前提と比較するための、LLM ベースの評価者プロトコルを提供する。

Figure 1 . Example of a VLCG. Given an image-question pair (“Have these people just married?”), the graph encodes causally relevant objects (e.g., persons, cake), attributes (wedding dress, suit), relations (wear), and scene-grounded assumptions linking visual evidence to the conclusion. Unlike scen

実験結果

リサーチクエスチョン

RQ1LVLM は与えられた画像-質問対に対して因果的に関連する属性を正しく同定できるか（CA）？
RQ2VLCG で同定された属性・前提に基づく推論連鎖は一貫性があり整合的か（CI）？
RQ3VLCG 構造化関連性を注入することで最終回答の正確性や推論品質は改善されるか（QA）？
RQ4構造化された因果グラフは推論を制約し、偽の手掛かりへの依存を減らす有用な Prior となるか？

主な発見

Setting	CA	CI	QA Accuracy	BLEU (reasoning)	ROUGE (reasoning)
Zero-shot	0.458	0.652	0.763	0.164	0.266
Standard ICL	0.455	0.654	0.763	0.163	0.264
VLCG (Best)	0.488	0.690	0.768	0.177	0.273

VLCG-Augmented Prompting により CA が 0.458 から 0.488へ改善（相対+6.6%）
VLCG-Augmented Prompting により CI が 0.652 から 0.690へ改善（相対+5.8%）
QA 正確性は VLCG プロンプティングでほぼ変わらず 0.763 から 0.768
ゼロショットおよび標準 ICL は CA/CI で限定的な向上に留まる一方、VLCG ガイダンスは因果推論をより安定化
BLEU と ROUGE 指標は控えめな向上を示し、改善は語彙の重複よりも構造化された関連性によるものと示唆
構造化された因果ガイダンスは関連性の priors として機能し、推論を因果的に意味のある変数へと結びつける役割を果たす

Figure 2 . Three diagnostic tasks in ViLCaR derived from the verified and pruned VLCGs: CA, CI, and QA.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。