QUICK REVIEW

[論文レビュー] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang, Hwanjun Song|arXiv (Cornell University)|Mar 13, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

要点: 本論文は自己中心的ビデオに対する抽象的な時空推論を formalize し、10 の controllable indoor scenarios を備えた VAEX-Bench を構築、14 の MLLMs をベンチマークして抽出的 vs 抽象的性能と underlying bottlenecks を比較します。

ABSTRACT

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

研究の動機と目的

embodio d Agent の理解のために自己中心ビデオにおける抽出的手掛かりを超えた抽象的時空推論の必要性を動機づける。
長期的な記憶、グローバルな空間推論、シーン再構成を評価するための controllable なデータセットと分類法を提案する。
最新の MLLMs を抽出的タスクと抽象的タスクで系統的に比較しボトルネックを特定する。
抽出的タスクから抽象的タスク、モデルファミリ間での性能低下を示す洞察を提供する。

提案手法

抽出的 vs. 抽象的なビデオ時空推論を区別する分類法を開発する。
一対一の展開原理によって、代表的な five 個の抽出タスクを抽象的対応タスクへ展開する。
クエリ条件付きビデオ構築パイプラインを用いて、10 個のクエリ・グラウンデッド自己中心型 Indoor シナリオを備えた VAEX-Bench を作成する。
SketchUp/Enscape で制御された 3D 環境をレンダリングし、軌跡ベースの自己中心ビデオとグラウンドトゥルース解答を提供する。
MCQ 対Free-form generation の比較を含む標準化されたゼロショット設定の下で 14 の最先端 MLLMs を評価し、診断分析を行う。

実験結果

リサーチクエスチョン

RQ1MLLMs は dispersed observations を統合して global scene 表現を構築することで抽象的な時空推論を行えるのか？
RQ2制御された自己中心的ビデオに対して、抽出的と抽象的な性能は最先端の MLLMs（商用・オープンソース）の間でどう異なるのか？
RQ3ビデオの抽象的推論を制限する主なボトルネック（知覚・時間・空間）は何か？
RQ4MCQ ベースの評価は抽象的タスクの能力を Free-form 生成と比較して過大評価させるのか？
RQ5モデルは global なレイアウトをどの程度再構築でき、部屋間・時間を跨ぐ長期的な集約を行えるのか？

主な発見

抽出的推論から抽象的推論へ移行すると、オープンソース・商用モデルを問わず性能が大幅に低下する。
MCQ の方が Free-form 生成より正確度が高く、選択肢の手掛かりに依存していることを示す。
モデルはビデオ全体で物体の認識・カウントが不安定で、しばしば誤計数・欠落を起こす。
長期的な時間記憶は証拠が分散している場合に弱く、自己中心的観察からの global な空間レイアウトの維持は困難。
Global Counting は部分観測下でのエンティティ継続性と部屋間集約の弱点を一貫して露呈する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。