[论文解读] A Survey on 3D Skeleton-Based Action Recognition Using Learning Method
This survey comprehensively reviews deep learning approaches for 3D skeleton-based action recognition, covering RNNs, CNNs, GCNs, and Transformers, and compares state-of-the-art methods on NTU-RGB+D and NTU-RGB+D 120 datasets.
3D skeleton-based action recognition (3D SAR) has gained significant attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or RGB data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3D skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on four fundamental deep architectures, i.e., Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Graph Convolutional Network (GCN), and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.
研究动机与目标
- Motivate the use of 3D skeleton data as a robust modality for action recognition.
- Systematically summarize deep learning architectures used for 3D SAR (RNNs, CNNs, GCNs, Transformers).
- Analyze data representations, spatial-temporal modeling, and co-occurrence features in skeleton-based methods.
- Provide benchmarks and insights on NTU-RGB+D and NTU-RGB+D 120 to guide future research.
提出的方法
- Introduce four fundamental DL architectures (RNNs, CNNs, GCNs, Transformers) and compare their properties in 3D SAR.
- Discuss data representations and preprocessing strategies for skeleton data (joint/bone graphs, skeleton images, co-occurrence features).
- Survey representative methods within each architecture, focusing on spatial-temporal modeling and attention mechanisms.
- Highlight graph-structured approaches (ST-GCN, 2s-AGCN, MS-G3D, etc.) and transformer-based variants (self-attention, decoupled attention) as core techniques.
- Present a data-driven analysis of datasets and performance trends on NTU-RGB+D and NTU-RGB+D 120.
实验结果
研究问题
- RQ1What are the main deep learning architectures used for 3D skeleton-based action recognition and how do they compare?
- RQ2How do RNNs, CNNs, GCNs, and Transformers handle spatial-temporal modeling and skeleton data representations?
- RQ3What are the current top-performing methods on NTU-RGB+D and NTU-RGB+D 120, and what architectures do they employ?
- RQ4What future directions and challenges remain for 3D SAR with skeleton data?
主要发现
| Dataset | Rank | Paper | Year | Accuracy (C-View / NTU-RGB+D) | Accuracy (C-Subject / NTU-RGB+D) | Method |
|---|---|---|---|---|---|---|
| NTU-RGB+D dataset | 1 | Wang et al. [109] | 2023 | 98.7 | 94.8 | Two-stream Transformer |
| NTU-RGB+D dataset | 2 | Duan et al. [23] | 2022 | n/a | 93.2 | Dynamic group GCN |
| NTU-RGB+D dataset | 3 | Liu et al. [68] | 2023 | 96.8 | 92.8 | Temporal decoupling GCN |
| NTU-RGB+D dataset | 4 | Zhou et al. [150] | 2022 | n/a | 92.9 | Transformer |
| NTU-RGB+D dataset | 5 | Chen et al. [14] | 2021 | 96.8 | 92.4 | Topology refinement GCN |
| NTU-RGB+D dataset | 6 | Zeng et al. [135] | 2021 | 96.7 | 91.6 | Skeletal GCN |
| NTU-RGB+D dataset | 7 | Liu et al. [74] | 2020 | 96.2 | 91.5 | Disentangling and unifying GCN |
| NTU-RGB+D dataset | 8 | Ye et al. [130] | 2020 | 96.0 | 91.5 | Dynamic GCN |
| NTU-RGB+D dataset | 9 | Shi et al. [87] | 2019 | 96.1 | 89.9 | Directed graph neural networks |
| NTU-RGB+D dataset | 10 | Shi et al. [88] | 2018 | 95.1 | 88.5 | Two-stream adaptive GCN |
| NTU-RGB+D dataset | 11 | Zhang et al. [140] | 2018 | 95.0 | 89.2 | LSTM based RNN |
| NTU-RGB+D dataset | 12 | Si et al. [91] | 2019 | 95.0 | 89.2 | AGC-LSTM(Joints&Part) |
| NTU-RGB+D dataset | 13 | Hu et al. [33] | 2018 | 94.9 | 89.1 | Non-local S-T + frequency attention |
| NTU-RGB+D dataset | 14 | Li et al. [51] | 2019 | 94.2 | 86.8 | GCN |
| NTU-RGB+D dataset | 15 | Liang et al. [57] | 2019 | 93.7 | 88.6 | 3S-CNN + multi-task ensemble learning |
| NTU-RGB+D dataset | 16 | Song et al. [94] | 2019 | 93.5 | 85.9 | Richly activated GCN |
| NTU-RGB+D dataset | 17 | Zhang et al. [141] | 2019 | 93.4 | 86.6 | Semantics-guided GCN |
| NTU-RGB+D dataset | 18 | Xie et al. [49] | 2018 | 93.2 | 82.7 | RNN+CNN+Attention |
| NTU-RGB+D 120 dataset | 1 | Wang et al. [109] | 2023 | 92.0 | 93.8 | Two-stream Transformer |
| NTU-RGB+D 120 dataset | 2 | Xu et al. [124] | 2023 | n/a | 91.8 | Language Knowledge-Assisted |
| NTU-RGB+D 120 dataset | 3 | Zhou et al. [150] | 2022 | 89.9 | 91.3 | Transformer |
| NTU-RGB+D 120 dataset | 4 | Duan et al. [23] | 2022 | 89.6 | 91.3 | Dynamic group GCN |
| NTU-RGB+D 120 dataset | 5 | Chen et al. [14] | 2021 | 88.9 | 90.6 | Topology refinement GCN |
| NTU-RGB+D 120 dataset | 6 | Chen et al. [13] | 2021 | 88.2 | 89.3 | Spatial-Temporal GCN |
| NTU-RGB+D 120 dataset | 7 | Liu et al. [74] | 2020 | 86.9 | 88.4 | Disentangling and unifying GCN |
| NTU-RGB+D 120 dataset | 8 | Cheng et al. [16] | 2020 | 85.9 | 87.6 | Shift GCN |
| NTU-RGB+D 120 dataset | 9 | Caetano et al. [6] | 2019 | 67.9 | 62.8 | Tree Structure + CNN |
| NTU-RGB+D 120 dataset | 10 | Caetano et al. [7] | 2019 | 67.7 | 66.9 | SkeleMotion |
| NTU-RGB+D 120 dataset | 11 | Liu et al. [69] | 2018 | 64.6 | 66.9 | Body Pose Evolution Map |
| NTU-RGB+D 120 dataset | 12 | Ke et al. [40] | 2018 | 62.2 | 61.8 | Multi-Task CNN with RotClips |
| NTU-RGB+D 120 dataset | 13 | Liu et al. [64] | 2017 | 61.2 | 63.3 | Two-Stream Attention LSTM |
| NTU-RGB+D 120 dataset | 14 | Liu et al. [71] | 2017 | 60.3 | 63.2 | Skeleton Visualization (Single Stream) |
| NTU-RGB+D 120 dataset | 15 | Jun et al. [67] | 2019 | 59.9 | 62.4 | Online+Dilated CNN |
| NTU-RGB+D 120 dataset | 16 | Ke et al. [39] | 2017 | 58.4 | 57.9 | Multi-Task Learning CNN |
| NTU-RGB+D 120 dataset | 17 | Jun et al. [65] | 2017 | 58.3 | 59.2 | Global Context-Aware Attention LSTM |
| NTU-RGB+D 120 dataset | 18 | Jun et al. [63] | 2016 | 55.7 | 57.9 | Spatio-Temporal LSTM |
- GCN-based methods generally achieve leading results on NTU-RGB+D and NTU-RGB+D 120 among skeleton-based approaches.
- Transformer-based methods show strong potential and are increasingly combined with GCNs or CNNs in hybrid models.
- Recent datasets (NTU-RGB+D 120) present increased difficulty, indicating room for further advancement across architectures.
- Representations that capture joint–bone structure and spatial-temporal graphs, along with adaptive topologies, contribute to performance gains.
- Datasets and evaluation protocols (Cross-Subject, Cross-View, Cross-Setup) are crucial for fair comparisons of 3D SAR models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。