QUICK REVIEW

[論文レビュー] EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

Karttikeya Mangalam, Raiymbek Akshulakov|arXiv (Cornell University)|Aug 17, 2023

Multimodal Machine Learning Applications被引用数 19

ひとこと要約

EgoSchema は Ego4D に由来する非常に長尺のビデオ QA ベンチマークを提示します。5000 件を超える複数選択問題が 250 時間のエゴセントリック動画にわたり、時系列証明書セットを用いて内在的な時間的難易度を分析します。

ABSTRACT

We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that ame{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io

研究の動機と目的

EgoSchema を、非常に長尺のビデオと言語理解の診断ベンチマークとして紹介する。
クリップ長を超えた内在的な時間的難易度を捉えるため、時間的証明長を定義する。
最新の最先端モデルは、ヒトと比較して非常に長尺の QA タスクで性能が低いことを示す。

提案手法

Dense なナレーションを伴う 3 分間の Ego4D クリップをフィルタ링して QA の三つ組 (QAW) を LLM プロンプティングで生成する。
長期推論に焦点を置き、質問と紛らわしい選択肢を作成するために複数の prompting 戦略（QAW-shot, Q(AW)-shot）を用いる。
品質の低いまたは根拠の薄い質問を除外するためにルールベースおよびLLMベースのフィルタリングを適用し、30 秒 minimum certificate length を二回の手動選定で保証する。
正解を検証するのに必要な最小サブクリップとして時間的証明書を定義し、クリップごとに証明書長を計算する。
EgoSchema に対する複数のビデオ-言語モデルおよび人間のゼロショット QA パフォーマンスをベンチマークする。

Figure 1: The EgoSchema dataset contains over 5000 very long-form video language understanding questions spanning over 250 hours of real, diverse, and high-quality egocentric video data. Each question requires choosing the correct answer out of five choices based on a three minute long video clip. T

実験結果

リサーチクエスチョン

RQ1EgoSchema の時間的証明書長で測定される内在的な時間的難易度はどの程度か。
RQ2最新のビデオ-言語モデルは、EgoSchema の長尺 QA タスクでゼロショットでどの程度の性能を示すか。
RQ3人間の性能は、時間/フレーム制約の異なる状況でモデルと比較してどうなるか。
RQ4長尺のビデオ理解において、人間に比べて現行モデルの改善余地はあるか。

主な発見

EgoSchema の中央値の時間的証明書長は約100秒で、2 番目に長いデータセットの約5倍、他のビデオ理解データセットの約10〜100倍長い。
EgoSchema におけるゼロショットの QA 精度は十億パラメータ級モデルで 33% 未満だが、人間は制約なし設定で約76%を達成する。
評価対象のいくつかのモデルはフレーム数が増えると精度が向上するが、約30フレームで飽和しており、依然として人間の性能を大きく下回る。
1 fps の動画でも人間の性能は約67% の精度に達し、動画からテキストへの設定では約76% に達することから、モデルとの差は顕著である。
EgoSchema は研究と商用利用を可能にするために Ego4D ライセンスの下で公開される予定である。

Figure 2: We introduce the notion of a temporal certificate set (top, § 3.2 ), a tool to measure the intrinsic temporal length of a benchmark and show the EgoSchema certificate length distribution (bottom, § 4.1 ) for randomly chosen $100$ clips.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。