QUICK REVIEW

[论文解读] Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本文定义 Spatial Causal Prediction (SCP)，构建 SCP-Bench，覆盖 2,500 个 QA 对于 1,181 段视频，用以评估在未观测的过去/未来状态下的空间因果推理，并分析模型的差距及提升策略。

ABSTRACT

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

研究动机与目标

为可见时空理解之外的空间因果推理提出新任务。
创建并发布 SCP-Bench，以系统评估对空间动态的感知、推理与预测。
对 23 种最先进模型进行基准测试，发现人类与机器在空间因果智能上的差距。
分析影响 SCP 性能的因素并提出改进策略。
提供关于扩展、感知增强和因果支架以推进 SCP 能力的洞察。

提出的方法

提出任务形式化：Spatial Causal Prediction (SCP) 作为带有部分时间上下文的问答任务。
通过获取多样化视频、半自动 QA 注释，并验证分割点以区分可见/不可见部分来构建 SCP-Bench。
在两个因果方向（向后、向前）及两种视角（单视角、多视角）中定义 8 种空间推理类别。
在多种 SCP 任务和场景类型上评估广泛模型集合（专有、开源、面向空间的模型）。
进行受控消融以将感知与推理区分开来（Gold Video vs. captions），并测试时序鲁棒性（单帧 vs 多帧）。
分析模型规模、感知增强（dense captions、空间交互图）以及外部因果支架（文本未来描述、世界模型）的影响。

实验结果

研究问题

RQ1当前多模态大模型在多样场景与视角下的 SCP 表现如何？
RQ2感知 vs 推理、时间 horizon、因果结构等哪些因素最限制现有模型的 SCP 性能？
RQ3模型规模扩展和因果支架是否能提升 SCP，哪些策略最有效？
RQ4多视角与前向预测任务是否比单视角和向后推理任务更具挑战性？

主要发现

Model	Avg.	Appearance Order	Counting	Planning	Relation	Relative Distance	Relative Size	Relative Speed	Spatial State
Human Performance	89.61	97.60	81.20	92.26	85.70	86.70	97.62	91.61	84.17
GPT-5 (Closed)	66.24	79.04	58.12	59.06	64.07	70.48	95.24	77.42	65.11
Gemini 2.5 Pro (Closed)	55.84	69.28	54.87	52.76	46.20	63.47	88.10	67.10	62.41
Gemini 2.5 Flash (Closed)	52.10	59.28	52.14	51.74	43.14	57.75	88.10	66.45	55.60
Claude Sonnet 4.5 (Closed)	56.14	68.86	52.14	57.43	45.65	60.90	80.95	68.39	63.90
Qwen3-VL-2B (Open)	43.04	41.92	42.74	45.01	40.85	44.41	59.52	47.10	40.65
Qwen3-VL-8B (Open)	47.52	54.49	51.28	49.29	42.33	49.47	90.48	46.45	46.40
Qwen3-VL-30B-A3B (Open)	54.16	65.27	52.14	54.79	46.22	56.65	85.71	66.45	57.19
Qwen3-VL-32B (Open)	56.84	59.88	51.28	58.66	52.63	57.98	90.48	67.10	55.04
Qwen3-VL-235B-A22B (Open)	61.04	67.07	54.70	60.90	55.03	63.03	97.62	74.84	63.31
Qwen3-Omni-30B-A3B (Open)	53.60	63.47	55.56	53.56	47.03	53.72	88.10	65.81	55.40
InternVL3.5-8B (Open)	50.52	59.88	54.70	54.79	43.82	54.52	61.90	58.71	44.96
InternVL3.5-38B (Open)	53.56	62.28	53.85	56.01	46.34	57.98	90.48	65.81	48.20
InternVL3.5-241B-A28B (Open)	56.96	67.07	60.68	61.10	46.11	60.37	90.48	68.39	60.07
MiniCPM-V-4.5 (Open)	43.80	53.29	49.57	43.99	36.04	49.20	76.19	52.26	42.81
DeepSeek-VL2 (Open)	38.08	45.51	38.46	39.51	29.41	45.74	73.81	53.55	33.81
NVILA-8B (Open)	34.40	36.53	36.75	38.09	30.66	30.05	59.52	38.71	37.05
NVILA-15B (Open)	45.28	54.49	45.30	48.07	35.35	52.13	73.81	50.97	49.28
LLaVA-OneVision-7B (Open)	36.48	42.51	37.61	37.07	31.24	38.30	64.29	46.45	35.61
LLaVA-OneVision-70B (Open)	50.84	64.67	52.99	48.68	44.39	53.46	78.57	61.94	51.80
LLaVA-OneVision-1.5-8B (Open)	45.52	56.29	47.01	46.44	39.13	50.27	80.95	51.61	41.73
LLaVA-NeXT-Video-7B (Open)	36.60	43.11	25.64	35.44	29.52	48.40	54.76	54.84	32.73
Spatial-MLLM (Spatial Model)	39.76	45.51	28.21	33.81	38.33	49.73	66.67	50.97	32.37
SpaceR (Spatial Model)	41.36	52.10	34.19	40.53	34.90	45.21	59.52	54.19	44.60

模型在 SCP-Bench 上离人类水平仍有较大差距（最佳约 66.24% 准确率 vs 人类平均 89.61%）。
大型开源模型在某些 SCP 任务上可匹配或超过部分封闭模型，显示规模化受益与公开模型竞争力。
相对规模、相对速度、空间状态属于较易类别；对象关系、推理与计数较难，需要更高层次的推理。
与过去推理相比，面向未来的预测仍具挑战性；时序外推 horizon 的提升有限，准确率在中位到中位偏上40%区间。
单独的感知并非瓶颈；对未观测空间状态的推理是核心限制；即使感知（Gold Video）提升，推理仍较困难。
模型规模的增大带来持续的性能提升；简单的 CoT/自我思考对提升有限或不稳定；感知增强带来边际收益。
未观测的空间因果支架（尤其是文本未来描述）相较图像/视频支架能显著提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。