QUICK REVIEW

[論文レビュー] Video-based Human Action Recognition using Deep Learning: A Review

Hieu H. Pham, Louahdi Khoudour|arXiv (Cornell University)|Aug 7, 2022

Human Pose and Action Recognition参考文献 231被引用数 24

ひとこと要約

動画ベースの人間の行動認識のための深層学習技術に関する包括的な調査であり、アーキテクチャ（CNN、RNN-LSTMs、DBN、SDA）、データセット、定量的ベンチマークに伴う現在の課題を概説する。

ABSTRACT

Human action recognition is an important application domain in computer vision. Its primary aim is to accurately describe human actions and their interactions from a previously unseen data sequence acquired by sensors. The ability to recognize, understand, and predict complex human actions enables the construction of many important applications such as intelligent surveillance systems, human-computer interfaces, health care, security, and military applications. In recent years, deep learning has been given particular attention by the computer vision community. This paper presents an overview of the current state-of-the-art in action recognition using video analysis with deep learning techniques. We present the most important deep learning models for recognizing human actions, and analyze them to provide the current progress of deep learning algorithms applied to solve human action recognition problems in realistic videos highlighting their advantages and disadvantages. Based on the quantitative analysis using recognition accuracies reported in the literature, our study identifies state-of-the-art deep architectures in action recognition and then provides current trends and open problems for future works in this field.

研究の動機と目的

動画ベースのアクション認識における最先端の深層学習モデルを評価する。
実際の映像設定におけるCNN、RNN-LSTMs、DBNs、SDAの利点と限界を分析する。
ベンチマークデータセットを要約し、それらが深層アクション認識の進展に与える影響を説明する。
深層学習ベースのアクション認識における未解決の問題と将来の研究方向を特定する。

提案手法

アクション認識に用いられる主要な深層学習アーキテクチャ（CNN、RNN-LSTMs、DBNs、SDAs）をレビューする。
各アーキテクチャの核心的アイデアと数理的基盤（畳み込み、プーリング、LSTMゲート、RBM、オートエンコーダー）を説明する。
標準データセット上での深層学習手法の定性的および定量的比較を提供する。

実験結果

リサーチクエスチョン

RQ1動画ベースのアクション認識に適用される主な深層学習アーキテクチャは何か。
RQ2これらのアーキテクチャは広く使用されるアクション認識ベンチマークでどのように機能するか。
RQ3現実的なビデオアクション認識への適用における現在の課題と未解決問題は何か。
RQ4大規模データセットとRGB-D/スケルトンデータがモデルの開発と評価にどのような影響を与えるか。

主な発見

CNNは生の映像フレームから局所的な結合、ウェイト共有、プーリングを介して直接特徴学習を導入し、アクション認識のエンドツーエンド表現学習を可能にした。
RNN-LSTMs（双方向LSTMsを含む）は、映像系列の時間的ダイナミクスと文脈をモデル化してアクション分類を行う。
DBNsと SDAs は、層ごとの事前学習を伴う深い階層表現を提供する。DBNはスタックされたRBMを用い、SDAはノイズ除去オートエンコーダーを用いて教師なし事前学習を行う。
最先端の HMDB-51 の結果として、RGB+光学フローの融合で62.0%（Wang et al., 2016）および2ストリームCNN+SVMで59.4%（Simonyan et al., 2014）が示されている。
ラボ制御データセット（KTH、Weizmann）から大規模で現実世界のデータセット（Sports-1M、ActivityNet、NTU RGB+D）への移行は、現実的なアクション認識の課題への移行を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。