QUICK REVIEW

[论文解读] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang, Hwanjun Song|arXiv (Cornell University)|Mar 13, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文将抽象性时空推理形式化于自我中心视频，构建包含10个可控室内场景的 VAEX-Bench，并对14个多模态大模型进行基准测试，以比较提取性与抽象性表现及潜在瓶颈。

ABSTRACT

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

研究动机与目标

Motivate the need for abstractive spatiotemporal reasoning beyond extractive cues in egocentric video understanding for embodied agents.
Propose a controllable dataset and taxonomy to evaluate long-horizon memory, global spatial reasoning, and scene reconstruction.
Systematically compare state-of-the-art MLLMs on extractive vs. abstractive tasks and identify bottlenecks.
Provide insights into how model performance degrades from extractive to abstractive tasks and across model families.

提出的方法

Develop a taxonomy distinguishing extractive vs. abstractive video spatiotemporal reasoning.
Expand five representative extractive tasks into abstractive counterparts via a one-to-one expansion principle.
Create VAEX-Bench with 10 query-grounded egocentric indoor scenarios using a query-conditioned video construction pipeline.
Render controlled 3D environments in SketchUp/Enscape, with trajectory-based egocentric videos and ground-truth answers.
Evaluate 14 SOTA MLLMs under standardized zero-shot settings, including MCQ vs. free-form generation comparisons and diagnostic analyses.

实验结果

研究问题

RQ1Can MLLMs perform abstractive spatiotemporal reasoning by integrating dispersed observations into a global scene representation?
RQ2How do extractive and abstractive performances differ across state-of-the-art MLLMs (proprietary vs. open-source) on controlled egocentric videos?
RQ3What are the main bottlenecks (perceptual, temporal, spatial) limiting abstractive reasoning in video?
RQ4Does MCQ-based evaluation inflate apparent capabilities compared to free-form generation for abstractive tasks?
RQ5To what extent can models reconstruct global layouts and perform long-horizon aggregation across rooms and time?

主要发现

Performance drops substantially when moving from extractive to abstractive reasoning across both open-source and proprietary models.
Accuracy is higher on MCQs than on free-form generation, indicating reliance on option cues.
Models show unstable object perception and counting across videos, with frequent miscounts and omissions.
Long-horizon temporal memory is weak when evidence is dispersed; maintaining global spatial layouts from egocentric observations is challenging.
Global Counting reveals persistent weaknesses in entity persistence and cross-room aggregation under partial observability.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。