QUICK REVIEW

[论文解读] Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本文定义 Spatial Causal Prediction (SCP)，构建 SCP-Bench，覆盖 2,500 个 QA 对于 1,181 段视频，用以评估在未观测的过去/未来状态下的空间因果推理，并分析模型的差距及提升策略。

ABSTRACT

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

研究动机与目标

为可见时空理解之外的空间因果推理提出新任务。
创建并发布 SCP-Bench，以系统评估对空间动态的感知、推理与预测。
对 23 种最先进模型进行基准测试，发现人类与机器在空间因果智能上的差距。
分析影响 SCP 性能的因素并提出改进策略。
提供关于扩展、感知增强和因果支架以推进 SCP 能力的洞察。

提出的方法

提出任务形式化：Spatial Causal Prediction (SCP) 作为带有部分时间上下文的问答任务。
通过获取多样化视频、半自动 QA 注释，并验证分割点以区分可见/不可见部分来构建 SCP-Bench。
在两个因果方向（向后、向前）及两种视角（单视角、多视角）中定义 8 种空间推理类别。
在多种 SCP 任务和场景类型上评估广泛模型集合（专有、开源、面向空间的模型）。
进行受控消融以将感知与推理区分开来（Gold Video vs. captions），并测试时序鲁棒性（单帧 vs 多帧）。
分析模型规模、感知增强（dense captions、空间交互图）以及外部因果支架（文本未来描述、世界模型）的影响。

实验结果

研究问题

RQ1当前多模态大模型在多样场景与视角下的 SCP 表现如何？
RQ2感知 vs 推理、时间 horizon、因果结构等哪些因素最限制现有模型的 SCP 性能？
RQ3模型规模扩展和因果支架是否能提升 SCP，哪些策略最有效？
RQ4多视角与前向预测任务是否比单视角和向后推理任务更具挑战性？

主要发现

模型在 SCP-Bench 上离人类水平仍有较大差距（最佳约 66.24% 准确率 vs 人类平均 89.61%）。
大型开源模型在某些 SCP 任务上可匹配或超过部分封闭模型，显示规模化受益与公开模型竞争力。
相对规模、相对速度、空间状态属于较易类别；对象关系、推理与计数较难，需要更高层次的推理。
与过去推理相比，面向未来的预测仍具挑战性；时序外推 horizon 的提升有限，准确率在中位到中位偏上40%区间。
单独的感知并非瓶颈；对未观测空间状态的推理是核心限制；即使感知（Gold Video）提升，推理仍较困难。
模型规模的增大带来持续的性能提升；简单的 CoT/自我思考对提升有限或不稳定；感知增强带来边际收益。
未观测的空间因果支架（尤其是文本未来描述）相较图像/视频支架能显著提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。