[Paper Review] A Large-scale Varying-view RGB-D Action Dataset for Arbitrary-view Human Action Recognition
This paper introduces a large-scale, 360° varying-view RGB-D action dataset with 118 subjects performing 40 actions across 8 fixed viewpoints and full-circle sequences, enabling arbitrary-view human action recognition. It proposes a View-guided Skeleton CNN (VS-CNN) that groups views into four sectors, trains view-specific classifiers, and fuses predictions via weighted averaging, achieving state-of-the-art performance on cross-subject, cross-view, and arbitrary-view recognition benchmarks.
Current researches of action recognition mainly focus on single-view and multi-view recognition, which can hardly satisfies the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of datasets also sets up barriers. To provide data for arbitrary-view action recognition, we newly collect a large-scale RGB-D action dataset for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The dataset includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 degree view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our dataset involves more participants, more viewpoints and a large number of samples. More importantly, it is the first dataset containing the entire 360 degree varying-view sequences. The dataset provides sufficient data for multi-view, cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.
Motivation & Objective
- To address the lack of large-scale datasets supporting arbitrary-view human action recognition in real-world HRI applications.
- To collect a comprehensive RGB-D dataset with full 360° view coverage, including 8 fixed viewpoints and continuous varying-view sequences.
- To develop a deep learning model capable of recognizing actions across large view changes, especially when test views are unseen during training.
- To evaluate the proposed method under cross-subject, cross-view, and arbitrary-view recognition settings, simulating real-world robot interaction scenarios.
Proposed method
- The dataset is collected using 8 synchronized RGB-D cameras arranged in a circle, capturing 118 subjects performing 40 fitness-related actions.
- The dataset includes synchronized RGB videos, depth sequences, and skeleton sequences, totaling 25,600 video samples over 83 hours of footage.
- The proposed VS-CNN model divides the 360° view space into four overlapping view groups to handle large view variations.
- A view-group prediction module assigns each action sample to one of the four view groups, guiding the training of four view-specific classifiers.
- The model uses feature fusion from four view-specific classifiers with learned weights to produce a final prediction via SoftMax.
- The framework is trained and evaluated on multiple protocols: cross-subject, cross-view, and arbitrary-view recognition, with varying-sequence segmentation for robustness.
Experimental results
Research questions
- RQ1Can a deep learning model achieve robust action recognition when test views are not seen during training, using only limited-view training data?
- RQ2How does full 360° view coverage in a dataset improve performance for arbitrary-view action recognition compared to limited-view benchmarks?
- RQ3To what extent does view-grouping and view-guided feature learning enhance generalization across large view changes?
- RQ4How does the performance of the proposed VS-CNN compare to existing methods under cross-subject, cross-view, and arbitrary-view recognition protocols?
Key findings
- The proposed VS-CNN achieves superior recognition accuracy on arbitrary-view action recognition tasks compared to eight baseline methods, including ResNeXt and JOULE.
- In arbitrary-view recognition II, where both training and testing data cover full-circle views, recognition accuracy curves are flat and consistently high, indicating strong generalization.
- Segmenting varying-view sequences into 10 sections yields better performance than 15 sections, as shorter clips align better with standard action durations and improve model generalization.
- Cross-subject recognition achieves the highest accuracy, while cross-view and arbitrary-view recognition show lower but still strong performance, indicating the challenge of domain shift across views.
- The use of full 360° varying-view sequences for training significantly improves model robustness and performance compared to training only on fixed viewpoints.
- The overlapping view-group design in VS-CNN enables effective feature learning across view transitions, reducing sensitivity to viewpoint changes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.