QUICK REVIEW

[论文解读] Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Chenyang Si, Ya Jing|arXiv (Cornell University)|May 7, 2018

Human Pose and Action Recognition参考文献 28被引用 32

一句话总结

该论文提出SR-TSL，一种新型基于骨架的动作识别模型，通过残差图神经网络实现空间推理，并利用跳跃片段LSTM进行时间堆叠学习，以捕捉高层次的空间结构和详细的时序动态。该方法在NTU RGB+D和SYSU数据集上达到最先进性能，NTU数据集上跨主体准确率提升至84.8%，跨视角准确率提升至92.4%，并通过消融实验和收敛性分析得到验证。

ABSTRACT

Skeleton-based action recognition has made great progress recently, but many problems still remain unsolved. For example, most of the previous methods model the representations of skeleton sequences without abundant spatial structure information and detailed temporal dynamics features. In this paper, we propose a novel model with spatial reasoning and temporal stack learning (SR-TSL) for skeleton based action recognition, which consists of a spatial reasoning network (SRN) and a temporal stack learning network (TSLN). The SRN can capture the high-level spatial structural information within each frame by a residual graph neural network, while the TSLN can model the detailed temporal dynamics of skeleton sequences by a composition of multiple skip-clip LSTMs. During training, we propose a clip-based incremental loss to optimize the model. We perform extensive experiments on the SYSU 3D Human-Object Interaction dataset and NTU RGB+D dataset and verify the effectiveness of each network of our model. The comparison results illustrate that our approach achieves much better results than state-of-the-art methods.

研究动机与目标

解决现有基于骨架的动作识别方法中空间结构表征不足以及对详细时序动态建模能力有限的问题。
通过捕捉长骨架序列中细粒度的时序动态，提升长期序列建模能力。
通过一种新型训练目标加速模型收敛并提升识别准确率。
验证空间推理与时间堆叠学习组件在单独及联合使用时的有效性。

提出的方法

空间推理网络（SRN）利用残差图神经网络（RGNN）建模身体各部分之间的高层次空间结构，将每个身体部分视为一个节点。
时间堆叠学习网络（TSLN）采用多个共享隐藏状态的跳跃片段LSTM，实现对短期动态的分层建模。
每个片段的初始隐藏状态被初始化为所有先前片段最终隐藏状态的和，以保留长距离依赖关系。
引入基于片段的增量损失以优化堆叠学习过程，提升收敛速度与性能。
采用双流架构同时处理位置序列与速度序列，以增强时序表征。
该方法在NTU RGB+D和SYSU 3D Human-Object Interaction数据集的骨架序列上进行端到端训练。

实验结果

研究问题

RQ1图神经网络能否有效建模单个骨架帧中身体各部分之间的高层次空间结构？
RQ2与标准RNN相比，跳跃片段LSTM堆叠能否更好地捕捉长骨架序列中的详细时序动态？
RQ3所提出的基于片段的增量损失是否能提升训练收敛速度与识别准确率？
RQ4空间推理与时间堆叠学习组件在基准数据集上单独及联合使用时，对性能的贡献如何？

主要发现

所提出的SR-TSL模型在NTU RGB+D跨主体基准上达到84.8%的准确率，优于先前最先进方法。
在跨视角设置下，SR-TSL达到92.4%的准确率，展现出在不同摄像头视角下的强泛化能力。
消融实验表明，空间推理网络与时间堆叠学习网络均显著提升性能，其中后者影响更显著。
基于片段的增量损失可加速收敛并提升最终准确率，尤其在训练初期效果明显。
当RGNN中片段长度$d \geq 6$且时间步$T \geq 5$时，性能趋于饱和，表明在此值以上收益递减。
同时处理位置与速度序列的双流架构性能优于仅使用单一模态的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。