QUICK REVIEW

[论文解读] STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu|arXiv (Cornell University)|May 15, 2024

Multimodal Machine Learning Applications参考文献 36被引用 22

一句话总结

STAR 引入了一个面向现实世界的视频基准，用于情境推理，结合通过超图对情境的抽象与基于逻辑的问题以及诊断式神经符号模型。

ABSTRACT

Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.

研究动机与目标

Motivate evaluation of real-world situated reasoning that links perception, abstraction, and logic.
Define a controlled benchmark with action-centered situations and structured hypergraph representations.
Provide procedurally generated questions with executable reasoning programs grounded in situations.
Offer a diagnostic neuro-symbolic model to dissect perception, abstraction, and reasoning components.

提出的方法

Represent situations as hypergraphs connecting entities, relations, and actions.
Generate questions and options from question templates and executable functional programs.
Use a video parser to extract entities, poses, and relationships from real videos.
Propose a Dynamics Transformer to model transitions over situation hypergraphs.
Implement a Program Executor that runs functional programs on hypergraphs to produce answers.
Evaluate state-of-the-art baselines and analyze gaps with the NS-SR diagnostic model.

实验结果

研究问题

RQ1What capabilities are required for effective reasoning in real-world, dynamic situations (perception, abstraction, and symbolic reasoning)?
RQ2How well do existing visual QA and video reasoning models handle interaction, sequence, prediction, and feasibility questions grounded in real-world situations?
RQ3Can a neuro-symbolic architecture disentangle perception, abstraction, language understanding, and symbolic reasoning to diagnose STAR’s challenges?

主要发现

Current baselines struggle on STAR’s situated reasoning tasks, with large performance gaps across question types.
Vision-language and video QA models improve modestly over random or frequent-answer baselines but still underperform on prediction and feasibility questions.
A fully oracle NS-SR variant achieves perfect scores, highlighting the importance of accurate situation hypergraphs and programs for reasoning.
Visual perception and situation abstraction are the primary bottlenecks, with language understanding contributing less to errors.
The proposed NS-SR architecture provides insights into where perception, abstraction, and symbolic reasoning can break down in real-world videos.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。