QUICK REVIEW

[论文解读] Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Lin Sun, Kui Jia|arXiv (Cornell University)|Oct 2, 2015

Human Pose and Action Recognition参考文献 30被引用 112

一句话总结

该论文提出了一种分因子化的时空卷积网络（FSTCN），一种深度神经网络架构，通过将3D卷积分解为顺序的2D空间卷积和1D时间卷积，以降低模型复杂度并提高训练效率。FSTCN在UCF-101和HMDB-51数据集上实现了最先进性能，且无需辅助训练数据，其平均性能优于双流CNN 1%，并可与使用额外视频数据的方法相媲美或超越。

ABSTRACT

Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.

研究动机与目标

解决3D卷积神经网络在人体动作识别中计算复杂度高和数据需求量大的问题。
通过将3D卷积分解为空间和时间阶段，提升时空特征学习能力。
克服人体动作中序列对齐和类别内差异带来的挑战。
开发一种高精度的深度神经网络架构，无需依赖辅助训练视频。
通过新颖的分因子化与排列机制，实现时空特征的有效端到端学习。

提出的方法

FSTCN采用级联架构，包含两个阶段：空间卷积层（SCL）用于2D空间特征学习，随后是时间卷积层（TCL）用于1D时间特征学习。
提出一种新型变换与排列算子（T-P算子），实现3D卷积核在深度学习框架中可分离的2D与1D分量分解。
网络采用基于采样的训练与推理策略，从每段视频中提取多个片段，以应对不同动作速度并提升鲁棒性。
在最终分类器层之前，将SCL与TCL的特征进行拼接，以融合空间外观与运动动态信息。
使用反向传播可视化显著性图，验证学习到的滤波器能关注语义相关区域（如面部动作中的嘴部）。
使用t-SNE可视化特征嵌入，表明FSTCN学习到的时空特征比单独的空间或时间特征更具判别性。

实验结果

研究问题

RQ1将3D卷积分解为2D空间与1D时间卷积是否能降低模型复杂度，同时保持或提升性能？
RQ2所提出的T-P算子是否能在深度学习框架中实现3D卷积核的有效且稳定的分解？
RQ3FSTCN架构是否能在不使用辅助训练视频的情况下，在标准基准上实现高精度？
RQ4结合的空间与时间特征相较于单独的空间或时间特征，在判别能力上表现如何？
RQ5片段采样策略在多大程度上提升了对不同动作速度和序列对齐问题的鲁棒性？

主要发现

FSTCN在UCF-101上达到87.9%的平均准确率，在HMDB-51上达到58.6%，采用平均融合策略，且无需辅助数据，性能优于双流CNN 1%。
采用基于SVM的分数融合后，FSTCN在UCF-101上达到88.1%，在HMDB-51上达到59.1%，性能与使用额外训练视频的方法相当或更优。
t-SNE可视化显示，FSTCN学习到的时空特征比仅空间或仅时间特征更具判别性，尤其在细微动作（如‘微笑’和‘咀嚼’）上表现更优。
显著性图证实，模型能有效聚焦于关键动作区域（如面部动作中的嘴部），表明其具备有效的注意力学习能力。
分因子化设计显著降低了卷积核复杂度，即使在视频数据有限的情况下也能实现有效训练，并在具有挑战性的基准上展现出强泛化能力。
消融实验表明，SCL与TCL的组合至关重要，二者相辅相成，共同提升整体性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。