QUICK REVIEW

[Paper Review] A Survey on 3D Skeleton-Based Action Recognition Using Learning Method

Bin Ren, Mengyuan Liu|arXiv (Cornell University)|Feb 14, 2020

Human Pose and Action Recognition75 references86 citations

TL;DR

This survey comprehensively reviews deep learning approaches for 3D skeleton-based action recognition, covering RNNs, CNNs, GCNs, and Transformers, and compares state-of-the-art methods on NTU-RGB+D and NTU-RGB+D 120 datasets.

ABSTRACT

3D skeleton-based action recognition (3D SAR) has gained significant attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or RGB data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3D skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on four fundamental deep architectures, i.e., Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Graph Convolutional Network (GCN), and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.

Motivation & Objective

Motivate the use of 3D skeleton data as a robust modality for action recognition.
Systematically summarize deep learning architectures used for 3D SAR (RNNs, CNNs, GCNs, Transformers).
Analyze data representations, spatial-temporal modeling, and co-occurrence features in skeleton-based methods.
Provide benchmarks and insights on NTU-RGB+D and NTU-RGB+D 120 to guide future research.

Proposed method

Introduce four fundamental DL architectures (RNNs, CNNs, GCNs, Transformers) and compare their properties in 3D SAR.
Discuss data representations and preprocessing strategies for skeleton data (joint/bone graphs, skeleton images, co-occurrence features).
Survey representative methods within each architecture, focusing on spatial-temporal modeling and attention mechanisms.
Highlight graph-structured approaches (ST-GCN, 2s-AGCN, MS-G3D, etc.) and transformer-based variants (self-attention, decoupled attention) as core techniques.
Present a data-driven analysis of datasets and performance trends on NTU-RGB+D and NTU-RGB+D 120.

Experimental results

Research questions

RQ1What are the main deep learning architectures used for 3D skeleton-based action recognition and how do they compare?
RQ2How do RNNs, CNNs, GCNs, and Transformers handle spatial-temporal modeling and skeleton data representations?
RQ3What are the current top-performing methods on NTU-RGB+D and NTU-RGB+D 120, and what architectures do they employ?
RQ4What future directions and challenges remain for 3D SAR with skeleton data?

Key findings

Dataset	Rank	Paper	Year	Accuracy (C-View / NTU-RGB+D)	Accuracy (C-Subject / NTU-RGB+D)	Method
NTU-RGB+D dataset	1	Wang et al. [109]	2023	98.7	94.8	Two-stream Transformer
NTU-RGB+D dataset	2	Duan et al. [23]	2022	n/a	93.2	Dynamic group GCN
NTU-RGB+D dataset	3	Liu et al. [68]	2023	96.8	92.8	Temporal decoupling GCN
NTU-RGB+D dataset	4	Zhou et al. [150]	2022	n/a	92.9	Transformer
NTU-RGB+D dataset	5	Chen et al. [14]	2021	96.8	92.4	Topology refinement GCN
NTU-RGB+D dataset	6	Zeng et al. [135]	2021	96.7	91.6	Skeletal GCN
NTU-RGB+D dataset	7	Liu et al. [74]	2020	96.2	91.5	Disentangling and unifying GCN
NTU-RGB+D dataset	8	Ye et al. [130]	2020	96.0	91.5	Dynamic GCN
NTU-RGB+D dataset	9	Shi et al. [87]	2019	96.1	89.9	Directed graph neural networks
NTU-RGB+D dataset	10	Shi et al. [88]	2018	95.1	88.5	Two-stream adaptive GCN
NTU-RGB+D dataset	11	Zhang et al. [140]	2018	95.0	89.2	LSTM based RNN
NTU-RGB+D dataset	12	Si et al. [91]	2019	95.0	89.2	AGC-LSTM(Joints&Part)
NTU-RGB+D dataset	13	Hu et al. [33]	2018	94.9	89.1	Non-local S-T + frequency attention
NTU-RGB+D dataset	14	Li et al. [51]	2019	94.2	86.8	GCN
NTU-RGB+D dataset	15	Liang et al. [57]	2019	93.7	88.6	3S-CNN + multi-task ensemble learning
NTU-RGB+D dataset	16	Song et al. [94]	2019	93.5	85.9	Richly activated GCN
NTU-RGB+D dataset	17	Zhang et al. [141]	2019	93.4	86.6	Semantics-guided GCN
NTU-RGB+D dataset	18	Xie et al. [49]	2018	93.2	82.7	RNN+CNN+Attention
NTU-RGB+D 120 dataset	1	Wang et al. [109]	2023	92.0	93.8	Two-stream Transformer
NTU-RGB+D 120 dataset	2	Xu et al. [124]	2023	n/a	91.8	Language Knowledge-Assisted
NTU-RGB+D 120 dataset	3	Zhou et al. [150]	2022	89.9	91.3	Transformer
NTU-RGB+D 120 dataset	4	Duan et al. [23]	2022	89.6	91.3	Dynamic group GCN
NTU-RGB+D 120 dataset	5	Chen et al. [14]	2021	88.9	90.6	Topology refinement GCN
NTU-RGB+D 120 dataset	6	Chen et al. [13]	2021	88.2	89.3	Spatial-Temporal GCN
NTU-RGB+D 120 dataset	7	Liu et al. [74]	2020	86.9	88.4	Disentangling and unifying GCN
NTU-RGB+D 120 dataset	8	Cheng et al. [16]	2020	85.9	87.6	Shift GCN
NTU-RGB+D 120 dataset	9	Caetano et al. [6]	2019	67.9	62.8	Tree Structure + CNN
NTU-RGB+D 120 dataset	10	Caetano et al. [7]	2019	67.7	66.9	SkeleMotion
NTU-RGB+D 120 dataset	11	Liu et al. [69]	2018	64.6	66.9	Body Pose Evolution Map
NTU-RGB+D 120 dataset	12	Ke et al. [40]	2018	62.2	61.8	Multi-Task CNN with RotClips
NTU-RGB+D 120 dataset	13	Liu et al. [64]	2017	61.2	63.3	Two-Stream Attention LSTM
NTU-RGB+D 120 dataset	14	Liu et al. [71]	2017	60.3	63.2	Skeleton Visualization (Single Stream)
NTU-RGB+D 120 dataset	15	Jun et al. [67]	2019	59.9	62.4	Online+Dilated CNN
NTU-RGB+D 120 dataset	16	Ke et al. [39]	2017	58.4	57.9	Multi-Task Learning CNN
NTU-RGB+D 120 dataset	17	Jun et al. [65]	2017	58.3	59.2	Global Context-Aware Attention LSTM
NTU-RGB+D 120 dataset	18	Jun et al. [63]	2016	55.7	57.9	Spatio-Temporal LSTM

GCN-based methods generally achieve leading results on NTU-RGB+D and NTU-RGB+D 120 among skeleton-based approaches.
Transformer-based methods show strong potential and are increasingly combined with GCNs or CNNs in hybrid models.
Recent datasets (NTU-RGB+D 120) present increased difficulty, indicating room for further advancement across architectures.
Representations that capture joint–bone structure and spatial-temporal graphs, along with adaptive topologies, contribute to performance gains.
Datasets and evaluation protocols (Cross-Subject, Cross-View, Cross-Setup) are crucial for fair comparisons of 3D SAR models.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.