QUICK REVIEW

[論文レビュー] Activity Graph Transformer for Temporal Action Localization

Megha Nawhal, Greg Mori|arXiv (Cornell University)|Jan 21, 2021

Human Pose and Action Recognition参考文献 61被引用数 42

ひとこと要約

論文は Activity Graph Transformer (AGT) を紹介します。未整列動画をグラフとして扱い、ラベルと開始/終了時間を持つ一連のアクションインスタンスを直接予測します。THUMOS14、Charades、EPIC-Kitchens-100 で最先端の結果を達成します。

ABSTRACT

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.

研究の動機と目的

未整動画でアクションが重なる、再発する、または非連続的である場合に非線形の時間推論の必要性を動機づける。
ラベル、開始、終了を直接予測するエンドツーエンドのグラフベースエンコーダ-デコーダトランスフォーマーを提案する。
予測を地真値と整合させるため Hungarian マッチャを用いてヒューリスティックな後処理を排除する。
THUMOS14、Charades、EPIC-Kitchens-100 データセットで最先端の性能を示す。

提案手法

バックボーンからの I3D 特徴を用いた 8 フレーム区切りで動画をコンテキストグラフへエンコードする。
グラフ注意機構を備えたエンコーダを用いて潜在的なグラフ文脈表現を生成する。
グラフ構造化されたクエリデコーダを使用して、潜在的アクションを表すアクション埋め込みの集合を生成する。
デコーダの各ノードから専用ヘッドを介してアクションラベルと正規化された開始/終了時刻を予測する。
分類確率と時間的近接性（L1 および IoU）損失を組み合わせた Hungarian マッチング損失でエンドツーエンドに訓練する。

実験結果

リサーチクエスチョン

RQ1グラフ上の直接的な集合予測タスクとして時系列アクション局在化を効果的に定式化できるか？
RQ2非連続的なグラフ表現上での推論は、重複、再発、非連続的なアクションの局在化を改善するか？
RQ3AGT は THUMOS14、Charades、EPIC-Kitchens-100 で最先端の方法と比較してどうか？
RQ4Hungarian マッチングを用いたエンドツーエンド訓練は、アクション局在化における非最大抑制などのヒューリスティクスを置換するのに十分か？

主な発見

AGT は THUMOS14、Charades、EPIC-Kitchens-100 データセットで最先端の mAP を達成。
THUMOS14 では、AGT は評価された IoU 閾値で従来の最高手法に対して最大で絶対 3.5% の改善を達成。
Charades の結果は顕著な性能リードを示し、AGT は従来手法より高い mAP を達成。
EPIC-Kitchens-100 の結果は動詞、名詞、アクションタスク全体で一貫した利得を示す。
アブレーション研究では、エンコーダまたはデコーダからグラフベース推論を除去すると局在化性能が低下することが示され、グラフ推論の重要性を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。