[论文解读] Spatial Causal Prediction in Video
本文定义 Spatial Causal Prediction (SCP),构建 SCP-Bench,覆盖 2,500 个 QA 对于 1,181 段视频,用以评估在未观测的过去/未来状态下的空间因果推理,并分析模型的差距及提升策略。
Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.
研究动机与目标
- 为可见时空理解之外的空间因果推理提出新任务。
- 创建并发布 SCP-Bench,以系统评估对空间动态的感知、推理与预测。
- 对 23 种最先进模型进行基准测试,发现人类与机器在空间因果智能上的差距。
- 分析影响 SCP 性能的因素并提出改进策略。
- 提供关于扩展、感知增强和因果支架以推进 SCP 能力的洞察。
提出的方法
- 提出任务形式化:Spatial Causal Prediction (SCP) 作为带有部分时间上下文的问答任务。
- 通过获取多样化视频、半自动 QA 注释,并验证分割点以区分可见/不可见部分来构建 SCP-Bench。
- 在两个因果方向(向后、向前)及两种视角(单视角、多视角)中定义 8 种空间推理类别。
- 在多种 SCP 任务和场景类型上评估广泛模型集合(专有、开源、面向空间的模型)。
- 进行受控消融以将感知与推理区分开来(Gold Video vs. captions),并测试时序鲁棒性(单帧 vs 多帧)。
- 分析模型规模、感知增强(dense captions、空间交互图)以及外部因果支架(文本未来描述、世界模型)的影响。
实验结果
研究问题
- RQ1当前多模态大模型在多样场景与视角下的 SCP 表现如何?
- RQ2感知 vs 推理、时间 horizon、因果结构等哪些因素最限制现有模型的 SCP 性能?
- RQ3模型规模扩展和因果支架是否能提升 SCP,哪些策略最有效?
- RQ4多视角与前向预测任务是否比单视角和向后推理任务更具挑战性?
主要发现
| Model | Avg. | Appearance Order | Counting | Planning | Relation | Relative Distance | Relative Size | Relative Speed | Spatial State |
|---|---|---|---|---|---|---|---|---|---|
| Human Performance | 89.61 | 97.60 | 81.20 | 92.26 | 85.70 | 86.70 | 97.62 | 91.61 | 84.17 |
| GPT-5 (Closed) | 66.24 | 79.04 | 58.12 | 59.06 | 64.07 | 70.48 | 95.24 | 77.42 | 65.11 |
| Gemini 2.5 Pro (Closed) | 55.84 | 69.28 | 54.87 | 52.76 | 46.20 | 63.47 | 88.10 | 67.10 | 62.41 |
| Gemini 2.5 Flash (Closed) | 52.10 | 59.28 | 52.14 | 51.74 | 43.14 | 57.75 | 88.10 | 66.45 | 55.60 |
| Claude Sonnet 4.5 (Closed) | 56.14 | 68.86 | 52.14 | 57.43 | 45.65 | 60.90 | 80.95 | 68.39 | 63.90 |
| Qwen3-VL-2B (Open) | 43.04 | 41.92 | 42.74 | 45.01 | 40.85 | 44.41 | 59.52 | 47.10 | 40.65 |
| Qwen3-VL-8B (Open) | 47.52 | 54.49 | 51.28 | 49.29 | 42.33 | 49.47 | 90.48 | 46.45 | 46.40 |
| Qwen3-VL-30B-A3B (Open) | 54.16 | 65.27 | 52.14 | 54.79 | 46.22 | 56.65 | 85.71 | 66.45 | 57.19 |
| Qwen3-VL-32B (Open) | 56.84 | 59.88 | 51.28 | 58.66 | 52.63 | 57.98 | 90.48 | 67.10 | 55.04 |
| Qwen3-VL-235B-A22B (Open) | 61.04 | 67.07 | 54.70 | 60.90 | 55.03 | 63.03 | 97.62 | 74.84 | 63.31 |
| Qwen3-Omni-30B-A3B (Open) | 53.60 | 63.47 | 55.56 | 53.56 | 47.03 | 53.72 | 88.10 | 65.81 | 55.40 |
| InternVL3.5-8B (Open) | 50.52 | 59.88 | 54.70 | 54.79 | 43.82 | 54.52 | 61.90 | 58.71 | 44.96 |
| InternVL3.5-38B (Open) | 53.56 | 62.28 | 53.85 | 56.01 | 46.34 | 57.98 | 90.48 | 65.81 | 48.20 |
| InternVL3.5-241B-A28B (Open) | 56.96 | 67.07 | 60.68 | 61.10 | 46.11 | 60.37 | 90.48 | 68.39 | 60.07 |
| MiniCPM-V-4.5 (Open) | 43.80 | 53.29 | 49.57 | 43.99 | 36.04 | 49.20 | 76.19 | 52.26 | 42.81 |
| DeepSeek-VL2 (Open) | 38.08 | 45.51 | 38.46 | 39.51 | 29.41 | 45.74 | 73.81 | 53.55 | 33.81 |
| NVILA-8B (Open) | 34.40 | 36.53 | 36.75 | 38.09 | 30.66 | 30.05 | 59.52 | 38.71 | 37.05 |
| NVILA-15B (Open) | 45.28 | 54.49 | 45.30 | 48.07 | 35.35 | 52.13 | 73.81 | 50.97 | 49.28 |
| LLaVA-OneVision-7B (Open) | 36.48 | 42.51 | 37.61 | 37.07 | 31.24 | 38.30 | 64.29 | 46.45 | 35.61 |
| LLaVA-OneVision-70B (Open) | 50.84 | 64.67 | 52.99 | 48.68 | 44.39 | 53.46 | 78.57 | 61.94 | 51.80 |
| LLaVA-OneVision-1.5-8B (Open) | 45.52 | 56.29 | 47.01 | 46.44 | 39.13 | 50.27 | 80.95 | 51.61 | 41.73 |
| LLaVA-NeXT-Video-7B (Open) | 36.60 | 43.11 | 25.64 | 35.44 | 29.52 | 48.40 | 54.76 | 54.84 | 32.73 |
| Spatial-MLLM (Spatial Model) | 39.76 | 45.51 | 28.21 | 33.81 | 38.33 | 49.73 | 66.67 | 50.97 | 32.37 |
| SpaceR (Spatial Model) | 41.36 | 52.10 | 34.19 | 40.53 | 34.90 | 45.21 | 59.52 | 54.19 | 44.60 |
- 模型在 SCP-Bench 上离人类水平仍有较大差距(最佳约 66.24% 准确率 vs 人类平均 89.61%)。
- 大型开源模型在某些 SCP 任务上可匹配或超过部分封闭模型,显示规模化受益与公开模型竞争力。
- 相对规模、相对速度、空间状态属于较易类别;对象关系、推理与计数较难,需要更高层次的推理。
- 与过去推理相比,面向未来的预测仍具挑战性;时序外推 horizon 的提升有限,准确率在中位到中位偏上40%区间。
- 单独的感知并非瓶颈;对未观测空间状态的推理是核心限制;即使感知(Gold Video)提升,推理仍较困难。
- 模型规模的增大带来持续的性能提升;简单的 CoT/自我思考对提升有限或不稳定;感知增强带来边际收益。
- 未观测的空间因果支架(尤其是文本未来描述)相较图像/视频支架能显著提升性能。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。