[Paper Review] Part-based Graph Convolutional Network for Action Recognition
This paper introduces PB-GCN, a part-based graph convolutional network that partitions the human skeleton into body parts, uses geometric and kinematic node features, and achieves state-of-the-art results on NTURGB+D and HDM05 for skeletal action recognition.
Human actions comprise of joint motion of articulated body parts or `gestures'. Human skeleton is intuitively represented as a sparse graph with joints as nodes and natural connections between them as edges. Graph convolutional networks have been used to recognize actions from skeletal videos. We introduce a part-based graph convolutional network (PB-GCN) for this task, inspired by Deformable Part-based Models (DPMs). We divide the skeleton graph into four subgraphs with joints shared across them and learn a recognition model using a part-based graph convolutional network. We show that such a model improves performance of recognition, compared to a model using entire skeleton graph. Instead of using 3D joint coordinates as node features, we show that using relative coordinates and temporal displacements boosts performance. Our model achieves state-of-the-art performance on two challenging benchmark datasets NTURGB+D and HDM05, for skeletal action recognition.
Motivation & Objective
- Motivate action recognition from skeletal data using a part-based viewpoint to capture part-specific and inter-part relations.
- Propose PB-GCN that partitions the skeleton graph into subgraphs with shared vertices and learns part-wise convolutions.
- Show that using geometric (relative coordinates) and motion (temporal displacements) features improves recognition over 3D joint coordinates.
- Demonstrate state-of-the-art performance on NTURGB+D and HDM05 datasets with the proposed framework.
Proposed method
- Define a general part-based graph convolutional network (PB-GCN) for graphs with known partition properties.
- Partition the skeleton graph into multiple overlapping subgraphs representing body parts (e.g., axial and appendicular components).
- Perform spatial convolutions independently on each part, then aggregate using a learned fusion function F_agg across parts.
- Extend to spatio-temporal graphs by connecting joints temporally within each part and across frames, followed by temporal convolution.
- Use relative coordinates and temporal displacements as node features, concatenated, instead of raw 3D joint coordinates.
- Incorporate a learnable edge weight mask and residual connections, following a ResNet-like architecture, with 9 SP-Temporal GCN units.
Experimental results
Research questions
- RQ1Can partitioning the skeleton graph into meaningful body parts improve action recognition over treating the skeleton as a single graph?
- RQ2Do geometric (relative coordinates) and kinematic (temporal displacements) features improve skeletal action recognition when used with PB-GCN?
- RQ3What is the impact of different part configurations (1, 2, 4, 6 parts) on recognition accuracy?
- RQ4How does PB-GCN compare to state-of-the-art graph-based skeletal action recognition methods on NTURGB+D and HDM05 datasets?
Key findings
- PB-GCN with four parts achieves higher accuracy than single-part and other partition schemes on NTURGB+D.
- Using both relative coordinates and temporal displacements (D_R || D_T) yields the best performance among tested signals, especially with more parts.
- PB-GCN outperforms previous graph-based skeletal action recognition methods on NTURGB+D and HDM05, achieving state-of-the-art results.
- Geometric and kinematic cues provide significant gains, with temporal displacements contributing notably to performance.
- Shared or separate convolution kernels across parts can be configured; part-based aggregation via F_agg effectively fuses information from multiple parts.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.