QUICK REVIEW

[논문 리뷰] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang, Hwanjun Song|arXiv (Cornell University)|2026. 03. 13.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 egocentric 비디오에 대한 추상적 시공간 추론을 형식화하고, 10개의 제어 가능한 실내 시나리오로 VAEX-Bench를 구축하며, 14개의 MLLMs를 벤치마킹해 추출적 성능과 추상적 성능, 및 근본적 병목 현상을 비교한다.

ABSTRACT

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

연구 동기 및 목표

임베디드 에이전트의 자가시점 비디오 이해에서 추출적 단서를 넘어서는 추상적 시공간 추론의 필요성을 제시한다.
장기 기억, 전역 공간 추론, 및 장면 재구성을 평가하기 위한 제어 가능한 데이터셋과 분류 체계를 제안한다.
최신 MLLMs를 추출적 대 추상적 작업으로 체계적으로 비교하고 병목 현상을 식별한다.
모델 성능이 추출적에서 추상적 작업으로, 그리고 모델 계열 간에 어떻게 저하되는지에 대한 통찰을 제공한다.

제안 방법

추출적 vs. 추상적 비디오 시공간 추론을 구분하는 분류 체계를 개발한다.
다섯 가지 대표적 추출적 작업을 일대일 확장 원칙을 통해 추상적 대응으로 확장한다.
쿼리 조건부 비디오 구성 파이프라인을 사용하여 10개의 쿼리-그 ground를 가진 자가시점 실내 시나리오로 VAEX-Bench를 구축한다.
SketchUp/Enscape에서 제어된 3D 환경을 렌더링하고 자가시점에서의 경로 기반 비디오와 정답(ground-truth)을 제공한다.
표준화된 제로샷 설정에서 MCQ 대 비교를 포함하여 14개의 SOTA MLLMs를 평가하고 진단 분석을 수행한다.

실험 결과

연구 질문

RQ1MLLMs가 dispersed observations를 통합하여 전역 장면 표현으로 구성할 수 있어 추상적 시공간 추론을 수행할 수 있는가?
RQ2제어된 자가시점 비디오에서 최신 MLLMs(독점적 vs 오픈 소스) 간의 추출적 및 추상적 성능 차이는 무엇인가?
RQ3추상적 추론을 제한하는 주요 병목 현상(지각, 시간적, 공간)이 무엇인가?
RQ4MCQ 기반 평가가 추상적 작업에서 자유로운 형식의 생성에 비해 능력을 과대평가하는가?
RQ5모델이 전역 레이아웃을 재구성하고 실내 및 시간에 걸친 장기 집계를 어느 정도까지 수행할 수 있는가?

주요 결과

추출적에서 추상적 추론으로 전환할 때 오픈소스 및 독점 모델 모두에서 성능이 상당히 떨어진다.
MCQ에서의 정확도가 자유형 생성보다 높아 옵션 단서에 의존한다는 것을 시사한다.
모델은 비디오 간에 객체 인식과 개수 세기에 있어 불안정한 모습을 보이며 오탑 및 누락이 빈번하다.
장기 기억은 증거가 분산될 때 약하며 자가시점 관찰에서 전역 공간 레이아웃을 유지하는 것이 어렵다.
Global Counting은 부분 관찰하에서 엔티티 지속성과 룸 간 집계의 지속적 약점을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.