QUICK REVIEW

[論文レビュー] The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

Bernard Ghanem, Juan Carlos Niebles|arXiv (Cornell University)|Aug 11, 2018

Human Pose and Action Recognition参考文献 5被引用数 55

ひとこと要約

この論文は2018年のActivityNet Challengeを要約し、6つのタスク（3つの主要なActivityNetタスクと3つのゲストタスク）と大規模動画における時間的提案、局在化、密なキャプション付けのトップパフォーマンスを示した提出物を詳述する。

ABSTRACT

The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.

研究の動機と目的

日常生活の活動の意味的視覚理解を大規模でユーザー生成ビデオにおいて極限まで高める。
多様なタスクとデータセットを通じて視覚的内容と人間のキャプションを結びつける。
ActivityNetとゲストデータセット全体で提案、局在化、キャプション付けの指標を用いた標準化評価を提供する。

提案手法

video理解の異なる側面を評価するために、6つのタスクを定義する（3つはActivityNetベース、3つはゲストタスク）。
時間的提案評価にはAR-ANを、提案の品質には平均AR/ANベースの指標を用いる。
時間的局在化にはtIoU閾値全体でのMean Average Precision (mAP)を使用する。
イベントの密なキャプション付けには平均METEOR/BLEU/CIDErベースの指標を用いる。
大規模な理解を広げるためにKinetics-600、AVA、Moments in Timeのゲストタスクを組み込む。

実験結果

リサーチクエスチョン

RQ1対象アクティビティを識別力を保ちながら、時間的アクション提案をどのように効率的に生成できるか。
RQ2現在の手法は、トリミングされていない長い動画での局在化と認識にどれだけ有効か。
RQ31つの動画内の複数イベントを検出、局在化、記述できるモデルはどの程度うまく機能するか（密なキャプション付け）？
RQ4大規模なゲストデータセット（Kinetics-600、AVA、Moments in Time）は、広範な活動理解にどんな洞察を提供するか。
RQ5大規模なアクティビティ認識における異なるタスクとデータセット全体で、トップパフォーマンスを示すアプローチは何か。

主な発見

タスク1（時系列アクション提案）：トップ3のAUCスコアは、Baidu Vis、上海交通大学、YH Technologiesでそれぞれ71.00、69.30、67.78である。
タスク2（時系列アクション局在化）：トップ3の平均mAPは38.53、35.49、35.27。
タスク3（イベントの密なキャプション付け）：トップ2の平均Meteorスコアは8.53と8.13。
タスクA（トリミング済み活動認識）：トップ3の平均誤差は10.99、11.69、12.20。
タスクB（時空間アクション局在化）：CVトラックのmAP@0.5IoUは21.08、21.03、19.60、フルトラックは20.99、19.60、16.76。
タスクC（トリミング済みイベント認識）：フルトラックのトップ3平均正解率は52.91、51.26、50.06、ミニトラックは47.72、45.49、45.10。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。