Skip to main content
QUICK REVIEW

[论文解读] Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang|arXiv (Cornell University)|Mar 4, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

本文定义 Spatial Causal Prediction (SCP),构建 SCP-Bench,覆盖 2,500 个 QA 对于 1,181 段视频,用以评估在未观测的过去/未来状态下的空间因果推理,并分析模型的差距及提升策略。

ABSTRACT

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

研究动机与目标

  • 为可见时空理解之外的空间因果推理提出新任务。
  • 创建并发布 SCP-Bench,以系统评估对空间动态的感知、推理与预测。
  • 对 23 种最先进模型进行基准测试,发现人类与机器在空间因果智能上的差距。
  • 分析影响 SCP 性能的因素并提出改进策略。
  • 提供关于扩展、感知增强和因果支架以推进 SCP 能力的洞察。

提出的方法

  • 提出任务形式化:Spatial Causal Prediction (SCP) 作为带有部分时间上下文的问答任务。
  • 通过获取多样化视频、半自动 QA 注释,并验证分割点以区分可见/不可见部分来构建 SCP-Bench。
  • 在两个因果方向(向后、向前)及两种视角(单视角、多视角)中定义 8 种空间推理类别。
  • 在多种 SCP 任务和场景类型上评估广泛模型集合(专有、开源、面向空间的模型)。
  • 进行受控消融以将感知与推理区分开来(Gold Video vs. captions),并测试时序鲁棒性(单帧 vs 多帧)。
  • 分析模型规模、感知增强(dense captions、空间交互图)以及外部因果支架(文本未来描述、世界模型)的影响。

实验结果

研究问题

  • RQ1当前多模态大模型在多样场景与视角下的 SCP 表现如何?
  • RQ2感知 vs 推理、时间 horizon、因果结构等哪些因素最限制现有模型的 SCP 性能?
  • RQ3模型规模扩展和因果支架是否能提升 SCP,哪些策略最有效?
  • RQ4多视角与前向预测任务是否比单视角和向后推理任务更具挑战性?

主要发现

ModelAvg.Appearance OrderCountingPlanningRelationRelative DistanceRelative SizeRelative SpeedSpatial State
Human Performance89.6197.6081.2092.2685.7086.7097.6291.6184.17
GPT-5 (Closed)66.2479.0458.1259.0664.0770.4895.2477.4265.11
Gemini 2.5 Pro (Closed)55.8469.2854.8752.7646.2063.4788.1067.1062.41
Gemini 2.5 Flash (Closed)52.1059.2852.1451.7443.1457.7588.1066.4555.60
Claude Sonnet 4.5 (Closed)56.1468.8652.1457.4345.6560.9080.9568.3963.90
Qwen3-VL-2B (Open)43.0441.9242.7445.0140.8544.4159.5247.1040.65
Qwen3-VL-8B (Open)47.5254.4951.2849.2942.3349.4790.4846.4546.40
Qwen3-VL-30B-A3B (Open)54.1665.2752.1454.7946.2256.6585.7166.4557.19
Qwen3-VL-32B (Open)56.8459.8851.2858.6652.6357.9890.4867.1055.04
Qwen3-VL-235B-A22B (Open)61.0467.0754.7060.9055.0363.0397.6274.8463.31
Qwen3-Omni-30B-A3B (Open)53.6063.4755.5653.5647.0353.7288.1065.8155.40
InternVL3.5-8B (Open)50.5259.8854.7054.7943.8254.5261.9058.7144.96
InternVL3.5-38B (Open)53.5662.2853.8556.0146.3457.9890.4865.8148.20
InternVL3.5-241B-A28B (Open)56.9667.0760.6861.1046.1160.3790.4868.3960.07
MiniCPM-V-4.5 (Open)43.8053.2949.5743.9936.0449.2076.1952.2642.81
DeepSeek-VL2 (Open)38.0845.5138.4639.5129.4145.7473.8153.5533.81
NVILA-8B (Open)34.4036.5336.7538.0930.6630.0559.5238.7137.05
NVILA-15B (Open)45.2854.4945.3048.0735.3552.1373.8150.9749.28
LLaVA-OneVision-7B (Open)36.4842.5137.6137.0731.2438.3064.2946.4535.61
LLaVA-OneVision-70B (Open)50.8464.6752.9948.6844.3953.4678.5761.9451.80
LLaVA-OneVision-1.5-8B (Open)45.5256.2947.0146.4439.1350.2780.9551.6141.73
LLaVA-NeXT-Video-7B (Open)36.6043.1125.6435.4429.5248.4054.7654.8432.73
Spatial-MLLM (Spatial Model)39.7645.5128.2133.8138.3349.7366.6750.9732.37
SpaceR (Spatial Model)41.3652.1034.1940.5334.9045.2159.5254.1944.60
  • 模型在 SCP-Bench 上离人类水平仍有较大差距(最佳约 66.24% 准确率 vs 人类平均 89.61%)。
  • 大型开源模型在某些 SCP 任务上可匹配或超过部分封闭模型,显示规模化受益与公开模型竞争力。
  • 相对规模、相对速度、空间状态属于较易类别;对象关系、推理与计数较难,需要更高层次的推理。
  • 与过去推理相比,面向未来的预测仍具挑战性;时序外推 horizon 的提升有限,准确率在中位到中位偏上40%区间。
  • 单独的感知并非瓶颈;对未观测空间状态的推理是核心限制;即使感知(Gold Video)提升,推理仍较困难。
  • 模型规模的增大带来持续的性能提升;简单的 CoT/自我思考对提升有限或不稳定;感知增强带来边际收益。
  • 未观测的空间因果支架(尤其是文本未来描述)相较图像/视频支架能显著提升性能。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。