QUICK REVIEW

[논문 리뷰] Deep High-Resolution Representation Learning for Human Pose Estimation

Ke Sun, Bin Xiao|arXiv (Cornell University)|2019. 02. 25.

Human Pose and Action Recognition참고 문헌 72인용 수 57

한 줄 요약

이 논문은 처리 전 과정에서 고해상도 표현을 유지하고 다중 스케일 특징을 반복적으로 융합하는 HRNet을 소개하며, COCO, MPII, PoseTrack 데이터셋에서 최첨단 포즈 추정 성능을 달성한다.

ABSTRACT

This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. The code and models have been publicly available at \url{https://github.com/leoxiaobin/deep-high-resolution-net.pytorch}.

연구 동기 및 목표

인간 포즈 추정에 대해 신뢰할 수 있고 정밀한 고해상도 표현을 학습하도록 동기를 부여한다.
저해상도 특징에서 해상도를 복구하는 대신 모든 단계에서 고해상도 표현을 유지하는 네트워크를 설계한다.
병렬의 고해상도에서 저해상도 서브네트워크 간의 반복적 다중 스케일 융합을 제안하여 고해상도 표현을 풍부하게 한다.
COCO와 MPII에서 우수한 키포인트 히트맵 정확도를 보이고 PoseTrack에서 포즈 추적이 개선되었음을 입증한다.

제안 방법

고해상도 서브네트워크로 시작하고 점차 병렬의 고해상도에서 저해상도 서브네트워크를 추가하는 High-Resolution Net(HRNet)을 제안한다.
다중 해상도 서브네트워크를 병렬로 연결하고 단계 간 및 단계 내 교환 유닛을 통해 반복적인 다중 스케일 융합을 수행한다.
최종 고해상도 표현으로부터 K개의 히트맵을 회귀하고 가우시안으로 표기된 ground-truth 히트맵과의 평균 제곱 오차 손실을 사용한다.
작은 너비(W32)와 큰 너비(W48)로 HRNet을 구성하고, 네 개의 단계와 여덟 개의 교환 유닛을 사용한다.
표준 데이터 증강, Adam 최적화기, 그리고 ImageNet 사전 학습 백본으로 학습하여 성능을 향상시킨다.

실험 결과

연구 질문

RQ1네트워크 전 과정을 통해 고해상도 표현을 유지하는 것이 전통적인 고→저해상도 파이프라인과 비교해 키포인트 위치 추정 정확도를 향상시키는가?
RQ2병렬 서브네트워크 간의 반복적 다중 스케일 융합이 더 풍부한 고해상도 특징과 더 나은 히트맵으로 이어지는가?
RQ3HRNet의 COCO, MPII, PoseTrack 벤치마크에서의 성능 향상은 최첨단 방법과 비교해 어떤 차이가 있는가?
RQ4네트워크 너비와 입력 해상도가 포즈 추정 정확도와 효율성에 어떤 영향을 미치는가?
RQ5HRNet은 단일 이미지 포즈 추정을 넘어 비디오 기반 포즈 추적에 효과적인가?

주요 결과

Method	Backbone	Input size	#Params	GFLOPs	AP	AP50	AP75	APM	APL	AR
HRNet-W32	HRNet-W32	256x192	28.5 M	7.10	73.4	89.5	80.7	70.2	80.1	78.9
HRNet-W32	HRNet-W32	256x192	28.5 M	7.10	74.4	90.5	81.9	70.8	81.0	79.8
HRNet-W48	HRNet-W48	256x192	63.6 M	14.6	75.1	90.6	82.2	71.5	81.8	80.4
SimpleBaseline	ResNet-152	256x192	68.6 M	15.7	72.0	89.3	79.8	68.7	78.9	77.8

HRNet-W32 (사전 학습 없이) COCO val에서 256x192 입력으로 73.4 AP를 달성하며, 유사한 크기의 Hourglass보다 GFLOPs가 낮고 우수하게 작동한다.
HRNet-W32 (사전 학습 포함) COCO val에서 74.4 AP, AP50 90.5, AP75 81.9, AR 79.8로 비사전 학습 버전보다 우수하다.
HRNet-W48 (사전 학습 포함) COCO val에서 75.1 AP, AP50 90.6, AP75 82.2, AR 80.4로 너비 증가가 정확도를 높임을 보여준다.
COCO test-dev에서 HRNet-W32와 HRNet-W48은 각각 74.9 AP와 75.5 AP를 달성(단일 모델, 상향식 접근 방식).
MPII에서 HRNet-W32는 92.3 PCKh@0.5를 달성하며 다수의 기존 방법을 제치고 최첨단에 근접하거나 이를 넘어선다.
PoseTrack 2017에서 HRNet-W48은 74.9 mAP 및 57.9 MOTA를 달성하며 다수의 기준선을 능가하고 비디오 추적 성능이 강함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.