Skip to main content
QUICK REVIEW

[论文解读] Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Sijie Yan, Yuanjun Xiong|arXiv (Cornell University)|Jan 23, 2018
Human Pose and Action Recognition参考文献 43被引用 596
一句话总结

ST-GCN 在骨架序列上学习时空图卷积以识别动作,在 Kinetics 和 NTU-RGB+D 上相较于手工部件方法取得了最先进的结果。

ABSTRACT

Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

研究动机与目标

  • Motivate skeleton-based action recognition as a modality with strong robustness to illumination and scene variation.
  • Develop a generic graph-based model that automatically learns spatial and temporal patterns from data rather than relying on hand-crafted parts.
  • Propose a Spatial-Temporal Graph Convolutional Network (ST-GCN) to operate on a skeleton graph sequence.
  • Investigate partitioning strategies and edge-weight learning to improve modeling of body parts and dynamics.
  • Demonstrate superior performance on large-scale datasets compared to prior methods.

提出的方法

  • Represent skeleton sequences as spatial-temporal graphs with joints as nodes and intra-frame plus inter-frame edges.
  • Apply spatial-temporal graph convolution with partitioned neighbor sets to model local joint interactions and temporal dynamics.
  • Use multiple ST-GCN layers with shared weights, followed by global pooling and a SoftMax classifier.
  • Introduce partition strategies (uni-labeling, distance, spatial configuration) to define edge weight sharing across neighbors.
  • Incorporate a learnable edge importance mask to weight the contribution of different joints/edges.
  • Train end-to-end with SGD; use data augmentation (random moving) and random fragment sampling on Kinetics.

实验结果

研究问题

  • RQ1Can ST-GCN outperform hand-crafted-part skeleton methods by learning spatial-temporal patterns directly from data?
  • RQ2How do different neighbor partitioning strategies affect action recognition performance?
  • RQ3Does incorporating learnable edge importance weighting improve accuracy?
  • RQ4Is the ST-GCN approach generalizable across datasets with varying joint counts and graph structures (2D OpenPose vs 3D Kinect data)?

主要发现

  • On Kinetics, ST-GCN with spatial configuration partitioning and edge weighting achieves 30.7% Top-1 and 52.8% Top-5 accuracy, outperforming baselines and prior skeleton-based methods.
  • Partitioning strategies with multiple subsets outperform uni-labeling, with spatial configuration providing the best gains.
  • Adding learnable edge importance weighting yields further improvement (~1% in Top-1/Top-5).
  • On NTU-RGB+D, ST-GCN achieves 81.5% (X-Sub) and 88.3% (X-View) top-1 accuracy, surpassing prior state-of-the-art methods on constrained data.
  • ST-GCN substantially outperforms methods using RGB/flow or hand-crafted features for skeleton-based action recognition across both unconstrained and constrained datasets.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。