QUICK REVIEW

[论文解读] An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Sijie Song, Cuiling Lan|arXiv (Cornell University)|Nov 18, 2016

Human Pose and Action Recognition参考文献 31被引用 481

一句话总结

本文提出一种端到端的基于LSTM的架构，具有空间关节注意力与时间帧注意力，用于基于骨架的动作识别，采用正则化损失和联合训练策略训练，在SBU和NTU数据集上实现了最先进的结果。

ABSTRACT

Human action recognition is an important task in computer vision. Extracting discriminative spatial and temporal features to model the spatial and temporal evolutions of different actions plays a key role in accomplishing this task. In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data. We build our model on top of the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which learns to selectively focus on discriminative joints of skeleton within each frame of the inputs and pays different levels of attention to the outputs of different frames. Furthermore, to ensure effective training of the network, we propose a regularized cross-entropy loss to drive the model learning process and develop a joint training strategy accordingly. Experimental results demonstrate the effectiveness of the proposed model,both on the small human action recognition data set of SBU and the currently largest NTU dataset.

研究动机与目标

通过对空间关节相关性和时间帧重要性的建模，推动从骨架数据进行鲁棒动作识别。
开发一个端到端的架构，学习在每帧中关注判别性关节，以及在时间上关注重要帧。
引入正则化损失项和联合训练策略，以稳定耦合的注意力网络的学习。
在公开的骨架数据集上证明有效性，包括SBU Kinect Interaction和NTU RGB+D。

提出的方法

提出一个基于LSTM的网络，带有在每帧内计算关节选择门的空间注意力模块，以对关节进行加权。
实现一个时间注意力模块，为帧选择门赋值，以对最终序列分类中的帧贡献进行加权。
将带有空间和时间注意力正则化项以及权重稀疏性项的正则化交叉熵损失公式化。
采用两阶段联合训练过程，在对整个网络进行微调前先预训练空间/时间注意力组件。
主网络使用三层LSTM，每个注意力子网络使用一层LSTM（每层100个单元）。
在SBU Kinect Interaction和NTU RGB+D数据集上在CS和CV设置下进行评估。

实验结果

研究问题

RQ1端到端的时空注意力能否比无注意力基线提升骨架基动作识别的性能？
RQ2在同时使用时，空间关节注意力与时间帧注意力是否提供互补的改进？
RQ3正则化项和提出的联合训练策略如何影响学习的稳定性和性能？
RQ4所提出的STA-LSTM与先前的最先进方法在SBU和NTU数据集上的比较如何？

主要发现

方法	准确率(%)
原始骨架 ( ?)	49.7
关节特征 ( ?)	80.3
原始骨架 ( ?)	79.4
关节特征 ( ?)	86.9
分层RNN ( ?)	80.35
共现RNN ( ?)	90.41
STA-LSTM	91.51

空间注意力和时间注意力分别在基线LSTM之上提升了准确率，分别约提升5.1%和6.4%（针对SBU/NTU）。
同时结合空间与时间注意力（STA-LSTM）在所有数据集上都取得最佳结果。
正则化项提升了空间和时间注意力模块的性能，联合训练策略增强了收敛性。
STA-LSTM在NTU（CS和CV）上相比先前方法取得显著的准确度提升，在SBU上也有竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。