QUICK REVIEW

[论文解读] Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Zhan Chen, Sicheng Li|arXiv (Cornell University)|Jun 27, 2022

Human Pose and Action Recognition被引用 31

一句话总结

提出 MST-GCN，采用多尺度时空图卷积（MS-GC 和 MT-GC），使感受野增大以捕捉短程与长程时空依赖，从而提升基于骨架的动作识别。在 NTU RGB+D、NTU-120 RGB+D 和 Kinetics-Skeleton 上超越基线，参数量可比。

ABSTRACT

Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

研究动机与目标

解释基于骨架的动作识别需要同时考虑短程与长程的空间依赖和时序动态。
引入多尺度的空间与时间图卷积模块，在不增加参数的情况下扩大感受野。
将 MS-GC 与 MT-GC 结合成 MST-GCN 模块并堆叠以进行端到端的运动表征学习。
在 NTU RGB+D、NTU-120 RGB+D 和 Kinetics-Skeleton 数据集的多项基准测试中展示有效性。

提出的方法

将骨架定义为一个时空图，关节为节点，骨骼/时间连接为边。
用 MS-GC 替代传统的单尺度图卷积，在分层残差布局中级联子图卷积以增大空间感受野。
将 MS-GC 扩展到时间域，形成 MT-GC，利用分层残差式和多尺度时间聚合来捕捉长程时序动态。
将 MS-GC 与 MT-GC 结合成 MST-GCN 模块并堆叠模块以形成完整的 MST-GCN 网络；提供一个可选的 STR-GC 变体，在一个模块内连接空间与时间子模块。
提供两种实现变体：(a) 将 MS-GC + MT-GC 替代 ST-GCN 块，(b) 使用在块内交替更新的时空残差 GC（STR-GC）。

实验结果

研究问题

RQ1多尺度空间图卷积能否捕捉骨架中超出局部邻域的远距离关节关系？
RQ2多尺度时间图卷积能否扩大时间感受野以有效建模长程动态？
RQ3MS-GC 与 MT-GC 模块是否互补，从而提升相较于 ST-GCN 基线的动作识别性能？
RQ4MST-GCN 是否具有迁移性并在 NTU RGB+D、NTU-120 RGB+D 和 Kinetics-Skeleton 数据集上达到最先进结果？

主要发现

方法	X-视角 (%)	X-子集 (%)
HBRNN	64.0	59.1
P-LSTM	67.3	60.7
TCN	83.1	74.3
VA-LSTM	87.7	79.2
ST-GCN	88.3	81.5
AS-GCN	94.2	86.8
2s AGC-LSTM	95.0	89.2
2s AGCN	95.1	88.5
2s NAS-GCN	95.7	89.4
4s DGNN	96.1	89.9
4s MS-AAGCN	96.2	90.0
2s MS-G3D	96.2	91.5
4s Shift-GCN	96.5	90.7
Js MST-GCN (ours)	95.1	89.0
Bs MST-GCN (ours)	95.2	89.5
2s MST-GCN (ours)	96.4	91.1
4s MST-GCN (ours)	96.6	91.5

MS-GC 通过同时捕捉局部与远离关节的依赖关系来提升空间特征表示，随着分裂数 s 增加，性能提升也增加。
MT-GC 扩展时域感受野，并且在更高的 s 下，相较于 ST-GCN 产生一致的精度提升。
MS-GC 与 MT-GC 互为补充；完整的 MST-GCN 组合比单独任一模块实现更高的准确率，在参数预算相近时有显著提升。
在 NTU RGB+D、NTU-120 RGB+D 和 Kinetics-Skeleton 上，MST-GCN 在多个基准测试中实现具有竞争力或最先进的 Top-1（以及有报道的 Top-5）准确率。
与基线 ST-GCN 相比，MST-GCN 在参数相近的情况下可提升约 1.8 个百分点，在参数约少三分之一的情况下提升约 0.9 个百分点（消融结果）。
可视化结果显示 MST-GCN 集中于与动作相关的关节，并能捕捉长程依赖（例如走路过程中的全身协调）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。