QUICK REVIEW

[論文レビュー] STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu|arXiv (Cornell University)|May 15, 2024

Multimodal Machine Learning Applications参考文献 36被引用数 22

ひとこと要約

STARは、状況抽象化をハイパーグラフで行い、論理に基づく質問と診断的なニューロシンボリックモデルを組み合わせた、実世界の動画ベンチマークを導入します。

ABSTRACT

Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.

研究の動機と目的

現実世界の状況推論を評価する動機づけ：知覚、抽象化、および論理を結びつける。
行動中心の状況と構造化されたハイパーグラフ表現を備えた、制御されたベンチマークを定義する。
状況に基づく実行可能な推論プログラムを持つ、手続き的に生成された質問を提供する。
知覚、抽象化、推論の構成要素を分解する診断的なニューロ・シンボリックモデルを提供する。

提案手法

状況をエンティティ、関係、アクションを結ぶハイパーグラフとして表現する。
質問テンプレートと実行可能な関数プログラムから質問と選択肢を生成する。
実世界の動画からエンティティ、ポーズ、関係を抽出するビデオパーサを使用する。
状況ハイパーグラフ上の遷移をモデル化するDynamics Transformerを提案する。
ハイパーグラフ上で関数型プログラムを実行して回答を生成するプログラム実行機を実装する。
最先端のベースラインを評価し、NS-SR診断モデルとギャップを分析する。

実験結果

リサーチクエスチョン

RQ1現実世界の動的な状況で効果的な推論に必要な能力は何か（知覚、抽象化、そして記号推論）？
RQ2既存のビジュアルQAおよび動画推論モデルは、現実世界の状況に基づく相互作用、シーケンス、予測、実現可能性に関する質問をどれほど扱えるか？
RQ3ニューロシンボリックアーキテクチャは、知覚、抽象化、言語理解、記号推論を分離してSTARの課題を診断できるか？

主な発見

現在のベースラインはSTARの状況推論タスクで苦戦しており、質問タイプ間で大きな性能ギャップがある。
ビジョン言語および動画QAモデルはランダムや頻出回答ベースラインよりわずかに改善するが、予測と実現可能性の質問では依然として性能が不足している。
完全なオラクルNS-SRバリアントは完璧なスコアを達成し、推論のための正確な状況ハイパーグラフとプログラムの重要性を浮き彫りにする。
視覚知覚と状況抽象化が主なボトルネックであり、言語理解はエラーに寄与する程度が低い。
提案するNS-SRアーキテクチャは、現実世界の動画において知覚、抽象化、記号推論がどこで崩れる可能性があるかについて洞察を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。