QUICK REVIEW

[Paper Review] Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Zhan Chen, Sicheng Li|arXiv (Cornell University)|Jun 27, 2022

Human Pose and Action Recognition31 citations

TL;DR

Proposes MST-GCN with multi-scale spatial (MS-GC) and temporal (MT-GC) graph convolutions, enabling large receptive fields to capture short- and long-range spatial-temporal dependencies for improved skeleton-based action recognition. Outperforms baselines on NTU RGB+D, NTU-120 RGB+D, and Kinetics-Skeleton with comparable parameters.

ABSTRACT

Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

Motivation & Objective

Motivate skeleton-based action recognition as requiring both short-range and long-range spatial dependencies and temporal dynamics.
Introduce multi-scale spatial and temporal graph convolution modules to enlarge the receptive field without adding parameters.
Couple MS-GC and MT-GC into MST-GCN blocks and stack them for end-to-end learning of motion representations.
Demonstrate effectiveness on NTU RGB+D, NTU-120 RGB+D, and Kinetics-Skeleton datasets across multiple benchmarks.

Proposed method

Define skeleton as a spatio-temporal graph with joints as nodes and skeletal/time connections as edges.
Replace traditional single-scale graph convolutions with MS-GC that cascades sub-graph convolutions in a hierarchical residual layout to enlarge spatial receptive field.
Extend MS-GC to the temporal domain as MT-GC, using hierarchical residual-like and multi-scale temporal aggregations to capture long-range temporal dynamics.
Combine MS-GC and MT-GC into MST-GCN blocks and stack blocks to form a full MST-GCN network; provide an alternative STR-GC variant that concatenates spatial and temporal sub-modules within a block.
Provide two implementation variants: (a) MS-GC + MT-GC in place of ST-GCN blocks, and (b) Spatial-Temporal Residual GC (STR-GC) with alternating updates in a block.

Experimental results

Research questions

RQ1Can multi-scale spatial graph convolutions capture distant joint relationships beyond local neighborhoods in skeletons?
RQ2Can multi-scale temporal graph convolutions enlarge the temporal receptive field to model long-range dynamics effectively?
RQ3Do MS-GC and MT-GC modules complement each other to improve action recognition performance over ST-GCN baselines?
RQ4Is MST-GCN transferable and achieving state-of-the-art results across NTU RGB+D, NTU-120 RGB+D, and Kinetics-Skeleton datasets?

Key findings

MS-GC improves spatial feature representations by capturing both local and distant joint dependencies, with performance gains increasing as the number of splits increases (s).
MT-GC expands temporal receptive fields and, with higher s, yields consistent accuracy gains relative to ST-GCN.
MS-GC and MT-GC are complementary; the full MST-GCN combination achieves higher accuracy than either module alone, with notable gains at comparable parameter budgets.
On NTU RGB+D, NTU-120 RGB+D, and Kinetics-Skeleton, MST-GCN achieves competitive or state-of-the-art Top-1 (and Top-5 where reported) accuracies across multiple benchmarks.
Compared to baseline ST-GCN, MST-GCN can achieve up to around 1.8 percentage points improvement with similar parameters and up to 0.9 percentage points with roughly one-third fewer parameters (ablation results).
Visualizations show MST-GCN focuses on action-relevant joints and can capture long-range dependencies (e.g., whole-body coordination during walking).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.