QUICK REVIEW

[Paper Review] Hand3D: Hand Pose Estimation using 3D Neural Network

Xiaoming Deng, Shuo Yang|arXiv (Cornell University)|Apr 7, 2017

Hand Gesture Recognition Systems20 references67 citations

TL;DR

This paper proposes a 3D CNN that directly estimates 3D hand joint positions from a TSDF volumetric representation of a depth image, with synthetic data augmentation and a TSDF refinement module, achieving state-of-the-art results on NYU and ICVL hand pose datasets.

ABSTRACT

We propose a novel 3D neural network architecture for 3D hand pose estimation from a single depth image. Different from previous works that mostly run on 2D depth image domain and require intermediate or post process to bring in the supervision from 3D space, we convert the depth map to a 3D volumetric representation, and feed it into a 3D convolutional neural network(CNN) to directly produce the pose in 3D requiring no further process. Our system does not require the ground truth reference point for initialization, and our network architecture naturally integrates both local feature and global context in 3D space. To increase the coverage of the hand pose space of the training data, we render synthetic depth image by transferring hand pose from existing real image datasets. We evaluation our algorithm on two public benchmarks and achieve the state-of-the-art performance. The synthetic hand pose dataset will be available.

Motivation & Objective

Motivate direct 3D hand pose estimation from a single depth image without post-processing or predefined models.
Propose a 3D volumetric representation (TSDF) and a 3D CNN to predict 3D joint locations in COM coordinates.
Improve training data diversity and depth quality via TSDF refinement and synthetic data augmentation with variable bone lengths.
Demonstrate state-of-the-art performance on NYU and ICVL hand pose benchmarks.

Proposed method

Convert depth maps to 60x60x60 TSDF volumes aligned at the hand COM.
Refine raw TSDF with a 3D FCN that completes missing depth and reduces artifacts.
Use a 3D ConvNet to directly regress 3D joint locations relative to the COM with an L2 loss.
Train the network end-to-end on augmented data including synthetic poses with variable bone lengths.
Perform data augmentation by transferring hand poses to configurable CAD models and rendering depth images.
Optionally recover poses from real data via inverse kinematics and transfer to BVH for synthetic data generation.

Experimental results

Research questions

RQ1Can a 3D CNN operating on TSDF volumes estimate 3D hand joint positions directly in COM coordinates without post-processing?
RQ2Does TSDF refinement and 3D data augmentation improve 3D hand pose accuracy on standard benchmarks?
RQ3How well does the method generalize to different hand skeletons and bone lengths?
RQ4What is the performance impact of the proposed synthetic data augmentation and bone-length variation on pose estimation?

Key findings

The method achieves state-of-the-art performance on NYU and ICVL hand pose datasets.
Direct 3D pose estimation in COM coordinates eliminates the need for post-processing to project 2D estimates into 3D.
TSDF refinement improves pose accuracy, especially at lower error thresholds.
Data augmentation with varying bone lengths and synthetic pose transfer significantly boosts performance.
The approach runs at about 30 FPS on a GTX TITAN X, faster than several model-based methods while delivering higher accuracy.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.