QUICK REVIEW

[论文解读] Review of Action Recognition and Detection Methods

Soo Min Kang, Richard P. Wildes|arXiv (Cornell University)|Oct 21, 2016

Human Pose and Action Recognition参考文献 5被引用 52

一句话总结

本文对计算机视觉中第三人称动作识别与检测方法进行了全面综述，分析了特征提取、编码和分类技术。它在基准数据集上评估了最先进方法，并识别出处理现实世界变化性和提升鲁棒性方面的关键挑战与开放问题。

ABSTRACT

In computer vision, action recognition refers to the act of classifying an action that is present in a given video and action detection involves locating actions of interest in space and/or time. Videos, which contain photometric information (e.g. RGB, intensity values) in a lattice structure, contain information that can assist in identifying the action that has been imaged. The process of action recognition and detection often begins with extracting useful features and encoding them to ensure that the features are specific to serve the task of action recognition and detection. Encoded features are then processed through a classifier to identify the action class and their spatial and/or temporal locations. In this report, a thorough review of various action recognition and detection algorithms in computer vision is provided by analyzing the two-step process of a typical action recognition and detection algorithm: (i) extraction and encoding of features, and (ii) classifying features into action classes. In efforts to ensure that computer vision-based algorithms reach the capabilities that humans have of identifying actions irrespective of various nuisance variables that may be present within the field of view, the state-of-the-art methods are reviewed and some remaining problems are addressed in the final chapter.

研究动机与目标

系统分析动作识别与检测的两阶段流程：特征提取与编码，随后进行分类。
评估现有算法在不同条件下的性能与局限性，涵盖多样化的基准数据集（如静态背景与动态背景、真实世界视频等）。
识别持续存在的挑战，如对干扰变量（例如视角、光照、遮挡）的鲁棒性，以及在现实场景中提升泛化能力的需求。
突出新兴趋势，包括基于深度学习的模型和第一人称动作识别，但主要聚焦于第三人称动作识别。

提出的方法

本文综述了基于采样方法（如均匀采样或密集采样）和描述子（如HOG、HOF、MBH）的特征提取技术。
研究了编码方法，包括码本生成（如K均值聚类）、特征分配（如基于直方图的方法）以及归一化池化（如VLAD、Fisher向量）。
评估了确定性分类器（如SVM、k-NN）和概率模型（如HMM、CRF），包括用于序列建模的时间状态空间模型。
分析了动作提议作为减少搜索空间的机制，采用基于超像素的分割、运动线索或格栅CRF的方法生成高动作性区域。
讨论了异常动作检测与动作预测作为相关任务，方法基于正常性建模，并通过置信度逐步从预测过渡到识别。
综述包括对KTH、UCF101、HMDB51、ActivityNet和THUMOS等数据集的对比分析，突出其评估协议差异与挑战。

实验结果

研究问题

RQ1不同的特征提取与编码策略如何影响动作识别与检测系统的性能？
RQ2在背景动态性与动作复杂性各异的基准数据集中，性能与鲁棒性的关键差异是什么？
RQ3当前方法在视角变化、遮挡和杂乱场景等多样化现实条件下的泛化能力如何？
RQ4基于深度学习的模型与传统手工设计特征方法在准确率与效率方面相比如何？
RQ5在实现动作识别与检测的人类级鲁棒性方面，仍存在哪些开放性问题？

主要发现

基于手工设计特征的传统方法（如iDT结合Fisher向量编码与SVM）在KTH和UCF101等受控数据集上表现强劲。
深度学习模型，特别是双流CNN，在ActivityNet和Sports-1M等大规模数据集上显著优于传统方法。
动作提议生成技术通过聚焦于高动作性区域，降低了计算成本，提升了检测效率，同时未牺牲准确性。
基于正常性建模的异常动作检测方法在识别意外行为方面展现出潜力，尤其在监控场景中。
动作预测模型在动作展开过程中表现出置信度逐渐上升的特性，使安全关键应用中可实现早期干预。
尽管已取得进展，但在处理动态背景、长时序依赖关系以及跨数据集的域偏移方面仍存在挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。