[Paper Review] HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
HigherHRNet introduces a scale-aware high-resolution feature pyramid with multi-resolution supervision and heatmap aggregation to improve bottom-up multi-person pose estimation, achieving state-of-the-art results on COCO test-dev and strong performance on CrowdPose.
Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHRNet outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all top-down methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/Higher-HRNet-Human-Pose-Estimation.
Motivation & Objective
- Address scale variation in bottom-up multi-person pose estimation, especially for small persons.
- Develop a high-resolution feature pyramid that preserves spatial detail across scales.
- Train with multi-resolution supervision and perform multi-resolution heatmap aggregation during inference.
- Demonstrate improved keypoint localization accuracy on COCO and robustness in crowded scenes (CrowdPose).
Proposed method
- Build on HRNet to create a high-resolution feature pyramid starting at 1/4 resolution and upsample using deconvolution to generate higher-resolution heatmaps.
- Apply multi-resolution supervision by transforming ground-truth keypoints to resolutions across the pyramid and using Gaussian heatmaps at each resolution.
- Predict heatmaps at multiple resolutions and aggregate them during inference to form scale-aware heatmaps.
- Use associative embedding for keypoint grouping to form person instances.
- Optionally add residual blocks in the deconvolution module to refine features and heatmaps.
Experimental results
Research questions
- RQ1Can a scale-aware, high-resolution feature pyramid improve keypoint localization for small persons in bottom-up pose estimation?
- RQ2Does multi-resolution supervision and heatmap aggregation yield performance gains without post-processing refinements?
- RQ3How does HigherHRNet perform on COCO and CrowdPose compared to existing bottom-up and top-down methods?
Key findings
- HigherHRNet achieves 66.4 AP over HRNet baseline and 70.5 AP with multi-scale test on COCO2017 test-dev, surpassing prior bottom-up methods.
- For medium-sized persons, HigherHRNet shows larger gains (APM improvement) compared to large persons, indicating better handling of scale variation.
- On COCO2017 test-dev, HigherHRNet-W48 with multi-scale test reaches 70.5 AP, outperforming all existing bottom-up methods without refinement.
- On CrowdPose test, HigherHRNet-W48 achieves 67.6 AP, surpassing top-down and prior bottom-up methods, showing robustness in crowded scenes.
- Ablation studies show deconvolution, feature concatenation, heatmap aggregation, and increased backbone capacity all contribute to AP gains, with one deconvolution module generally yielding best COCO performance.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.