QUICK REVIEW

[论文解读] Video-based Human Action Recognition using Deep Learning: A Review

Hieu H. Pham, Louahdi Khoudour|arXiv (Cornell University)|Aug 7, 2022

Human Pose and Action Recognition参考文献 231被引用 24

一句话总结

对基于视频的人体动作识别的深度学习技术进行全面综述，概述架构（CNNs、RNN-LSTMs、DBNs、SDAs）、数据集，以及带有量化基准的当前挑战。

ABSTRACT

Human action recognition is an important application domain in computer vision. Its primary aim is to accurately describe human actions and their interactions from a previously unseen data sequence acquired by sensors. The ability to recognize, understand, and predict complex human actions enables the construction of many important applications such as intelligent surveillance systems, human-computer interfaces, health care, security, and military applications. In recent years, deep learning has been given particular attention by the computer vision community. This paper presents an overview of the current state-of-the-art in action recognition using video analysis with deep learning techniques. We present the most important deep learning models for recognizing human actions, and analyze them to provide the current progress of deep learning algorithms applied to solve human action recognition problems in realistic videos highlighting their advantages and disadvantages. Based on the quantitative analysis using recognition accuracies reported in the literature, our study identifies state-of-the-art deep architectures in action recognition and then provides current trends and open problems for future works in this field.

研究动机与目标

评估用于视频动作识别的最先进深度学习模型。
分析 CNNs、RNN-LSTMs、DBNs 和 SDAs 在真实视频场景中的优点与局限性。
总结基准数据集及其对深度动作识别进展的影响。
识别深度学习基于动作识别领域的开放问题和未来研究的潜在方向。

提出的方法

回顾用于动作识别的关键深度学习架构（CNNs、RNN-LSTMs、DBNs、SDAs）。
解释每种架构的核心思想和数学基础（卷积、池化、LSTM 门控、RBMs、自编码器）。
对标准数据集上的深度学习方法进行定性与定量比较。

实验结果

研究问题

RQ1应用于基于视频的动作识别的主要深度学习架构有哪些？
RQ2这些架构在广泛使用的动作识别基准上的表现如何？
RQ3在将深度学习应用于现实世界视频动作识别方面，当前面临哪些挑战和开放问题？
RQ4大型数据集以及 RGB-D/骨架数据如何影响模型开发与评估？

主要发现

CNNs 通过局部连接、权重共享和池化，从原始视频帧直接进行特征学习，从而实现端到端的动作识别表示学习。
RNN-LSTMs（包括双向 LSTM）对视频序列中的时序动态与上下文进行建模，以进行动作分类。
DBNs 和 SDAs 提供深层分层的特征表示，并具有分层预训练；DBNs 使用堆叠的 RBMs，SDAs 使用去噪自编码器进行无监督预训练。
展示的 HMDB-51 最新结果包括 62.0% 的 RGB+光流融合（Wang et al., 2016）以及 59.4% 的两流 CNN+SVM（Simonyan et al., 2014）。
从实验室受控数据集（KTH、Weizmann）向大规模、野外数据集（Sports-1M、ActivityNet、NTU RGB+D）的演进凸显了向现实动作识别挑战的转变。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。