QUICK REVIEW

[논문 리뷰] RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Tao Jiang, Xinchen Xie|arXiv (Cornell University)|2024. 07. 11.

Human Pose and Action Recognition인용 수 5

한 줄 요약

RTMW는 RTMPose를 기반으로 PAFPN 및 Hierarchical Encoding Module을 도입하여 신체 부위 전반의 미세한 자세 정확도를 개선한 실시간 멀티인물 2D 및 단안 3D 전신 포즈 추정 오픈 소스 모델입니다.

ABSTRACT

Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose

연구 동기 및 목표

실시간 전신 포즈 추정의 도전 과제 다루기(몸, 손, 얼굴, 발 포함).
다중 규모 특징 융합을 통해 미세한 부분 위치화를 개선하기 위해 기존 RTMPose 아키텍처를 활용하고 개선하기.
성능 향상을 위해 수동으로 정렬된 다중 데이터셋 트레이닝 체계와 두 단계 증류를 활용.
좌표 분류 전략(SimCC)을 사용한 단안 3D 전신 포즈 추정으로 접근 확장 및 데이터셋 통합 트레이닝.
산업 배포 및 실시간 추론을 위한 다중 크기의 오픈 소스 모델 제공.
COCO-Wholebody 및 H3WB에서 경쟁력 있는 정확도 시연과 함께 추론 효율성 유지.

제안 방법

작은 부위(얼굴, 손, 발)의 다중 스케일 특징 해상도를 개선하기 위해 RTMPose에 PAFPN(피처 피라미드)과 HEM(계층적 인코딩 모듈)을 도입합니다.
2D 키포인트를 위한 고해상도 히트맵을 피하고 아키텍처 복잡도를 줄이기 위해 SimCC 기반 좌표 분류를 채택합니다.
DWPose에서처럼 2단계 증류를 적용하고 COCO-Wholebody 133포인트 스키마에 매핑된 14개의 수작업으로 정렬된 오픈 소스 데이터셋에 대해 공동 학습합니다.
RTMW를 3D로 확장하는 RTMW3D를 추가하여 z축 예측 분기를 도입하고 데이터셋을 통일하기 위한 루트 포인트 기반 z 오프셋 체계를 사용합니다.
2D/3D 결합 데이터셋에서 z축 마스크로 RTMW/RTMW3D를 학습하여 통합 2D-3D 학습을 가능하게 하고 3D 포즈 추정 품질을 향상합니다.
실시간 배포 및 산업용 사용을 위한 오픈 소스 코드와 모델(RTMW/RTMW3D)을 제공합니다.

실험 결과

연구 질문

RQ1RTMW가 실시간 추론을 유지하면서 전신 포즈 추정(몸, 얼굴, 손, 발)에서 우수한 정확도를 달성할 수 있나요?
RQ2PAFPN과 HEM이 손과 발처럼 저해상도 부위의 위치 정확도에 어떤 영향을 미치나요?
RQ3두 단계 증류와 데이터셋 정렬이 오픈 소스 전신 포즈 성능을 RTMPose보다 개선하나요?
RQ4SimCC 기반 좌표 분류 approach를 단일 학습 스키마로 단일화된 학습으로 단안 3D 전신 포즈 추정에 효과적으로 적용할 수 있나요?
RQ5RTMW/RTMW3D의 CPU에서의 실제 성능(속도/지연)은 어떠하며, 이전의 오픈 소스 방법들과 비교하면 어떤가요?

주요 결과

RTMW-l은 COCO-Wholebody에서 70.2 mAP를 달성하여 이 벤치마크에서 오픈 소스 모델로서 70 mAP를 상회합니다.
RTMW3D는 3D 전신 포즈 추정에서 강력한 성능을 보여줍니다(COCO-Wholebody 테스트 유사 결과 및 H3WB 벤치마크).
PAFPN 및 HEM 모듈은 저 해상도 부위(손/발) 위치화 및 전체 전신 AP/AR를 크게 개선합니다.
두 단계 증류 및 COCO-Wholebody 133포인트 스키마에 정렬된 14개 데이터셋의 동시 학습은 RTMPose 기준을 넘어서는 정확도를 높입니다.
RTMW/RTMW3D는 실시간 배포에 적합한 CPU에서 ONNXRuntime으로 탐지 속도를 유지합니다.
3D에서 SimCC 기반 접근법과 루트 포인트 z-오프셋 프레임워크는 효과적인 단안 3D 전신 포즈 추정을 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.