QUICK REVIEW

[Paper Review] SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Carlos Caetano, Jessica Sena|arXiv (Cornell University)|Jul 30, 2019

Human Pose and Action Recognition42 references212 citations

TL;DR

SkeleMotion encodes temporal dynamics of skeleton joints as motion magnitude and orientation across multiple temporal scales, used as input to a tiny CNN, achieving state-of-the-art results on NTU RGB+D 120 when fused with a spatial skeleton representation.

ABSTRACT

Due to the availability of large-scale skeleton datasets, 3D human action recognition has recently called the attention of computer vision community. Many works have focused on encoding skeleton data as skeleton image representations based on spatial structure of the skeleton joints, in which the temporal dynamics of the sequence is encoded as variations in columns and the spatial structure of each frame is represented as rows of a matrix. To further improve such representations, we introduce a novel skeleton image representation to be used as input of Convolutional Neural Networks (CNNs), named SkeleMotion. The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Different temporal scales are employed to compute motion values to aggregate more temporal dynamics to the representation making it able to capture longrange joint interactions involved in actions as well as filtering noisy motion values. Experimental results demonstrate the effectiveness of the proposed representation on 3D action recognition outperforming the state-of-the-art on NTU RGB+D 120 dataset.

Motivation & Objective

Motivate and improve skeleton-based 3D action recognition by explicitly modeling joint motion information.
Propose a novel skeleton image representation (SkeleMotion) that encodes magnitude and orientation of joint motions.
Leverage multi-scale temporal aggregation to capture long-range joint interactions and reduce noise.
Provide a lightweight CNN classifier that can train quickly on compact representations.
Demonstrate state-of-the-art or competitive results on NTU RGB+D 60/120 when using SkeleMotion, including fusion with spatial representations.

Proposed method

Construct a predefined joint chain C via depth-first skeleton traversal to preserve spatial relations.
Compute per-frame joint coordinates S and derive motion structure D by frame-difference with lag d (D = S_{c,t+d} - S_c).
Derive magnitude M and orientation θ from D, with θ computed from xy, yz, zx components and filtered by a magnitude threshold m to suppress noise.
Normalize and resize the resulting M and θ representations to form SkeleMotion images (C x T x channels).
Apply a tiny CNN (3 conv layers, 2 FC layers) with training from scratch for action classification.
Introduce Temporal Scale Aggregation (TSA) by computing D, M, θ over multiple temporal lags d and stacking results to enrich temporal dynamics.

Experimental results

Research questions

RQ1Can explicit motion information (magnitude and orientation) across multiple temporal scales improve skeleton-based action recognition over existing skeleton image representations?
RQ2Does multi-scale temporal aggregation help capture long-range joint interactions and reduce noisy motion signals?
RQ3How does SkeleMotion perform relative to state-of-the-art skeleton-image based methods on NTU RGB+D 60 and 120 datasets, including when fused with spatial representations?

Key findings

SkeleMotion with Magnitude (TSA) achieves strong cross-view accuracy on NTU RGB+D 60, outperforming several baselines.
SkeleMotion with Magnitude (TSA) achieves 69.6% cross-subject and 80.1% cross-view accuracy on NTU RGB+D 60 with TSA.
Using Orientation (TSA) alone yields competitive results but Magnitude (TSA) generally performs better; combining Magnitude+Orientation (TSA) improves accuracy further.
Fusion of SkeleMotion with Yang et al. (TSSI) methods further improves results, surpassing several baselines on NTU RGB+D 60 in both early and late fusion settings.
On NTU RGB+D 120, Magnitude+Orientation (TSA) based results are competitive with state-of-the-art LSTM-based approaches, and when fused with Yang et al., achieve state-of-the-art-like performance, surpassing multiple prior skeleton-based methods.
The study shows that explicit motion modeling and TSA provide notable gains over motion-naive skeleton representations and baseline motion encodings.
The code for SkeleMotion is publicly available at https://github.com/carloscaetano/skeleton-images for reproducibility.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.