QUICK REVIEW

[論文レビュー] TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Mina Bishay, Georgios Zoumpourlis|arXiv (Cornell University)|Jul 21, 2019

Human Pose and Action Recognition被引用数 84

ひとこと要約

TARN は、 few-shot および zero-shot アクション認識のための時系列注意関係ネットワークを導入します。ビデオセグメントを整列させるセグメントレベルの注意を使用し、ビデオのマッチングのための深い距離尺度を学習します。ファインチューニングや追加メモリモジュールなしで、FSL における最先端結果を、ZSL では競争力のある結果を達成します。

ABSTRACT

In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognition) or a video and a semantic representation such as word vector (in the case of zero-shot action recognition). By contrast to other works in few-shot and zero-shot action recognition, we a) utilise attention mechanisms so as to perform temporal alignment, and b) learn a deep-distance measure on the aligned representations at video segment level. We adopt an episode-based training scheme and train our network in an end-to-end manner. The proposed method does not require any fine-tuning in the target domain or maintaining additional representations as is the case of memory networks. Experimental results show that the proposed architecture outperforms the state of the art in few-shot action recognition, and achieves competitive results in zero-shot action recognition.

研究の動機と目的

ビデオ全体を比較するのではなく、ビデオのセグメントを比較することで、few-shot アクション認識に対処する。
ビデオのセグメントを意味的クラス表現と関連付けることにより、ゼロショットアクション認識へ拡張する。
メモリネットワークやターゲットドメインのファインチューニングを必要としない、エンドツーエンドで学習可能なアーキテクチャを開発する。

提案手法

埋め込みモジュールは C3D特徴量を用いてビデオセグメントを処理し、双方向GRUによりセグメント埋め込みを生成する。
リレーションモジュールはセグメントごとの注意を適用してサンプルとクエリのセグメントを揃え、表現を等しいセグメント長に変換する。
セグメントごとの比較を深い距離学習ネットワークに入力し、各ビデオペアのリレーションスコアを生成する。
リレーションスコアのソフトマックスによりクラス確率を得る。K-shot の場合、クラスごとにスコアを平均化する。

実験結果

リサーチクエスチョン

RQ1セグメントレベルの注意は、few-shot アクション認識における時間的整列とマッチングを改善できるか？
RQ2学習済み深層距離測定を備えたセグメント単位の比較は、FSLにおける全動画比較や固定距離アプローチより優れているか？
RQ3このフレームワークは、ビデオセグメントとの整合を目的とした意味ベクトルをターゲットとして使用するゼロショットアクション認識へ拡張できるか？

主な発見

セグメントごとの注意と深い距離学習を備えた TARN は、1ショットから5ショット設定のファーストショットアクション認識において、最先端を上回る。
比較層での類似度測定として EucCos を使用すると、試験したオプションの中で最良の結果を得られる。
注意ベースの複数セグメント比較は、データセットと特徴タイプを問わず、単一ベクトルの基準法（TARN-single）を上回る。
ゼロショット設定では、特に UCF-101 の分割で競争力のある結果を達成し、マルチセグメント対属性の比較が最良の性能を提供する。
このフレームワークでは、C3D-based features が一般に ResNet-50 features より few-shot アクション認識で優れている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。