QUICK REVIEW

[論文レビュー] Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は Spatial Causal Prediction (SCP) を定義し、SCP-Bench を構築。2,500 の QA ペアを 1,181 本の動画に対して提供し、観測済みの過去/未来状態を超える空間因果推論を評価。モデルのギャップと改善戦略を分析する。

ABSTRACT

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

研究の動機と目的

visiblespa 光度の理解を超えた空間因果推論の新タスクを定式化する。
SCP-Bench を作成・公開し、空間ダイナミクスの知覚、推論、予測を体系的に評価する。
23 の最先端モデルをベンチマークし、人間と機械の空間因果知能のギャップを特定する。
SCP の性能に影響を与える要因を分析し、改善戦略を提案する。
SCP 能力を高めるためのスケーリング、知覚強化、因果的支架に関する洞察を提供する。

提案手法

可視的時空理解を超えた空間因果推論 (SCP) を QA タスクとして形式化する。
多様な動画の収集、半自動 QA アノテーション、可視部分と不可視部分を分離するカットポイントの検証を通じて SCP-Bench を構築する。
2 つの因果方向（後方、前方）と 2 つの視点（単一視点、複数視点）にまたがる 8 種類の空間推論カテゴリを定義する。
複数の SCP タスクとシーンタイプに対して、商用・オープンソース・空間的特化モデルを広範に評価する。
perception と reasoning を分離するための厳密なアブレーション（Gold Video vs. captions）を実施し、単一フレーム vs. 複数フレームで時間的頑健性を検証する。
モデル規模、知覚強化（密なキャプション、空間相互作用グラフ）、外部因果支架（テキストの未来予測、世界モデル）の影響を分析する。

実験結果

リサーチクエスチョン

RQ1 現在のマルチモーダル LLM が多様なシーンと視点で SCP-Bench に対してどの程度性能を発揮するか？
RQ2 SCP の性能を最も制限する要因は、知覚と推論、時間的 horizon、因果構造のどれか？
RQ3 モデル規模の拡大と因果支架は SCP を改善できるか、どの戦略が最も効果的か？
RQ4 複数視点と前向き予測タスクは、単一視点と後方推論タスクより難易度が高いか？

主な発見

Model	Avg.	Appearance Order	Counting	Planning	Relation	Relative Distance	Relative Size	Relative Speed	Spatial State
Human Performance	89.61	97.60	81.20	92.26	85.70	86.70	97.62	91.61	84.17
GPT-5 (Closed)	66.24	79.04	58.12	59.06	64.07	70.48	95.24	77.42	65.11
Gemini 2.5 Pro (Closed)	55.84	69.28	54.87	52.76	46.20	63.47	88.10	67.10	62.41
Gemini 2.5 Flash (Closed)	52.10	59.28	52.14	51.74	43.14	57.75	88.10	66.45	55.60
Claude Sonnet 4.5 (Closed)	56.14	68.86	52.14	57.43	45.65	60.90	80.95	68.39	63.90
Qwen3-VL-2B (Open)	43.04	41.92	42.74	45.01	40.85	44.41	59.52	47.10	40.65
Qwen3-VL-8B (Open)	47.52	54.49	51.28	49.29	42.33	49.47	90.48	46.45	46.40
Qwen3-VL-30B-A3B (Open)	54.16	65.27	52.14	54.79	46.22	56.65	85.71	66.45	57.19
Qwen3-VL-32B (Open)	56.84	59.88	51.28	58.66	52.63	57.98	90.48	67.10	55.04
Qwen3-VL-235B-A22B (Open)	61.04	67.07	54.70	60.90	55.03	63.03	97.62	74.84	63.31
Qwen3-Omni-30B-A3B (Open)	53.60	63.47	55.56	53.56	47.03	53.72	88.10	65.81	55.40
InternVL3.5-8B (Open)	50.52	59.88	54.70	54.79	43.82	54.52	61.90	58.71	44.96
InternVL3.5-38B (Open)	53.56	62.28	53.85	56.01	46.34	57.98	90.48	65.81	48.20
InternVL3.5-241B-A28B (Open)	56.96	67.07	60.68	61.10	46.11	60.37	90.48	68.39	60.07
MiniCPM-V-4.5 (Open)	43.80	53.29	49.57	43.99	36.04	49.20	76.19	52.26	42.81
DeepSeek-VL2 (Open)	38.08	45.51	38.46	39.51	29.41	45.74	73.81	53.55	33.81
NVILA-8B (Open)	34.40	36.53	36.75	38.09	30.66	30.05	59.52	38.71	37.05
NVILA-15B (Open)	45.28	54.49	45.30	48.07	35.35	52.13	73.81	50.97	49.28
LLaVA-OneVision-7B (Open)	36.48	42.51	37.61	37.07	31.24	38.30	64.29	46.45	35.61
LLaVA-OneVision-70B (Open)	50.84	64.67	52.99	48.68	44.39	53.46	78.57	61.94	51.80
LLaVA-OneVision-1.5-8B (Open)	45.52	56.29	47.01	46.44	39.13	50.27	80.95	51.61	41.73
LLaVA-NeXT-Video-7B (Open)	36.60	43.11	25.64	35.44	29.52	48.40	54.76	54.84	32.73
Spatial-MLLM (Spatial Model)	39.76	45.51	28.21	33.81	38.33	49.73	66.67	50.97	32.37
SpaceR (Spatial Model)	41.36	52.10	34.19	40.53	34.90	45.21	59.52	54.19	44.60

SCP-Bench における人間レベルには遠く及ばず（最高 ~66.24% 正答率 vs. 89.61% 人間平均）。
大規模オープンソースモデルは、特定の SCP タスクで一部のクローズドモデルと同等以上を示すことができ、スケーリングと公開モデルの競争力を示唆。
相対的なサイズ・相対的な速度・空間状態は比較的容易なカテゴリ。オブジェクト関係、計画、カウントはより難しく、高次推論を要する。
過去推論と比べて未来指向の予測は依然難しく、時間的外挿の horizon による正確さの向上は限定的で、正確さは各 horizon で約中位の 40 台後半。
知覚だけがボトルネックではなく、未観測の空間状態に関する推論が核となる制約。Gold Video による知覚改善があっても、推論は依然として難しい。
モデルサイズの増加は一貫した性能向上をもたらす。単純な CoT/自己思考は限定的または一貫性のない改善。知覚強化は限られた利得。
未観測の空間因果支架（特にテキストの未来記述）は、画像/動画ベースの支架よりも性能を有意に向上させる可能性。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。