[Paper Review] OriNet: A Fully Convolutional Network for 3D Human Pose Estimation
OriNet presents a fully convolutional approach that predicts 3D human pose from a single image by modeling limb orientations bound to limb regions and jointly predicting 2D keypoints, achieving strong generalization and robustness to bounding box errors.
In this paper, we propose a fully convolutional network for 3D human pose estimation from monocular images. We use limb orientations as a new way to represent 3D poses and bind the orientation together with the bounding box of each limb region to better associate images and predictions. The 3D orientations are modeled jointly with 2D keypoint detections. Without additional constraints, this simple method can achieve good results on several large-scale benchmarks. Further experiments show that our method can generalize well to novel scenes and is robust to inaccurate bounding boxes.
Motivation & Objective
- Motivate robust 3D pose estimation from a single RGB image without strict cropping or fixed scale requirements.
- Propose a new orientation-based representation for limbs to decouple pose from bone length and improve generalization.
- Jointly model limb orientations with 2D keypoint detections within a fully convolutional framework.
- Demonstrate robustness to inaccurate bounding boxes and show competitive or state-of-the-art results on standard benchmarks.
Proposed method
- Represent each limb by a unit orientation vector derived from its two endpoint joints.
- Bind each limb orientation to an approximate limb region via a bounding box around the limb segment to preserve spatial association with the image.
- Use an orientation map per limb where limb regions are filled with the orientation vectors and the background is zero; train with L_o = sum_k ||O_k - Ō_k||^2.
- Predict 2D keypoint heatmaps in parallel to orientational maps; train with sigmoid cross-entropy loss L_p and combine losses as L = L_o + λ L_p with λ = 0.2.
- Adopt a stacked-hourglass backbone (5-stack) to produce per-stack predictions; fuse image features, keypoint heatmaps, and orientation cues across stacks to refine predictions.
- Inference: extract 2D keypoints from heatmaps, crop limb regions on the orientation map, average orientations in each region, and recover 3D pose using limb orientations plus limb-length ratios and scale.
Experimental results
Research questions
- RQ1Can limb orientations bound to limb regions provide a robust representation for 3D pose estimation from monocular images?
- RQ2Does combining limb orientation with 2D keypoint detection in a fully convolutional pipeline improve generalization and robustness to bounding box errors?
- RQ3How does orientation-based prediction compare to direct bone-length or joint-coordinate regression in FCN architectures?
- RQ4What is the generalization performance of OriNet across datasets and novel scenes?
Key findings
- The orientation-based representation is scale-invariant and improves generalization across datasets and novel scenes.
- Coupling limb orientations with limb-region bounding boxes preserves spatial associations and improves pose reasoning
- The method achieves competitive or state-of-the-art results on Human3.6M and MPI-INF-3DHP datasets, with robustness to bounding box jitter.
- The approach is robust to background and requires less dependence on tightly cropped subjects.
- In ablations, using orientations outperforms bone-length representations in both single-stack and multi-stack configurations.
- Predictions can run at 20fps on a Titan XP, demonstrating practical efficiency.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.