QUICK REVIEW

[论文解读] A Comprehensive Study of Deep Video Action Recognition

Yi Zhu, Xinyu Li|arXiv (Cornell University)|Dec 11, 2020

Human Pose and Action Recognition参考文献 274被引用 115

一句话总结

本论文综述了超过200种深度学习方法用于视频动作识别，讨论数据集和挑战，基准流行模型，并发布可复现性的代码。

ABSTRACT

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

研究动机与目标

调查视频动作识别领域在200+篇论文中的研究格局。
编目数据集及其对模型设计与评估的影响。
分析从双流网络到3D卷积神经网络和计算高效架构的模型演变。
在标准数据集上对代表性方法进行基准测试，以比较准确性和效率。
提供开放问题与机遇，以引导未来的研究与发展。

提出的方法

按时间顺序回顾视频动作识别的核心发展（从手工特征到CNN、两流网络、3D CNN，以及计算高效模型）。
系统性讨论塑造模型设计与评估的数据集与挑战。
在标准基准测试中对流行方法进行经验性基准评估，以评估准确性和效率。
发布 PyTorch 和 MXNet 的模型实现，以确保可重复性。
分析视频动作识别领域未来研究的开放问题与机遇。

实验结果

研究问题

RQ1哪些数据集和评估协议对视频动作识别模型的设计影响最大？
RQ2模型架构如何演化以解决时序建模和计算效率的问题？
RQ3两流方法与3D CNN 方法之间的权衡是什么，计算高效方法又如何比较？
RQ4还有哪些开放问题和机遇可以推进视频动作识别？
RQ5多流和多模态方法（姿态、对象、音频）如何提升识别性能？

主要发现

超过200篇论文和17个有影响力的数据集塑造了该领域及其评估实践。
两流网络通过融合外观和运动信息（RGB 帧和光流）确立了重要性。
3D CNN（如 I3D）通过直接建模时空特征显著提升了性能，尤其是在对像 Kinetics400 这样的大数据集进行预训练后。
基于分段的和计算高效的模型（如 TSN、TSM、X3D）实现了对长程时序建模，并可在更大的数据集上部署。
在核心基准上的结果表明，当从浅层到深层架构以及从2D到3D表示转变时，性能有显著提升，I3D在 UCF101 和 HMDB51 上经预训练后达到高性能。
作者发布代码以促进可重复性，并为研究人员提供模型库（model zoo）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。