QUICK REVIEW

[论文解读] RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Tao Jiang, Xinchen Xie|arXiv (Cornell University)|Jul 11, 2024

Human Pose and Action Recognition被引用 5

一句话总结

RTMW 是一个实时开源模型，用于多人体的 2D 和单目 3D 全身姿态估计，基于 RTMPose，结合 PAFPN 与 Hierarchical Encoding Module 以提高对各身体部位的细粒度姿态准确性。

ABSTRACT

Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose

研究动机与目标

解决覆盖身体、手部、面部和足部的实时全身姿态估计挑战。
利用并强化现有 RTMPose 架构，通过多尺度特征融合提升对细粒度部位的定位。
采用丰富且人工对齐的多数据集训练方案及两阶段蒸馏以提升性能。
通过坐标分类策略（SimCC）和数据集统一训练，将方法扩展到单目 3D 全身姿态估计。
提供用于工业部署和实时推理的多种尺寸的开源模型。
在 COCO-Wholebody 和 H3WB 上展示具有竞争力的准确性，同时保持高效推理。

提出的方法

将 PAFPN（特征金字塔）和 HEM（分层编码模块）引入 RTMPose，以提升小部位（面部、手部、足部）的多尺度特征分辨率。
对 2D 关键点采用基于 SimCC 的坐标分类，避免高分辨率热力图并降低架构复杂度。
采用两阶段蒸馏（如 DWPose 中所述），并在 14 个人工对齐的开源数据集上联合训练，映射到 COCO-Wholebody 133 点框架。
通过添加 z 轴预测分支并使用基于根点的 z 偏移方案，将 RTMW 扩展到 3D 的 RTMW3D，以统一数据集。
在组合的 2D/3D 数据集上训练 RTMW/RTMW3D，并使用 z 轴掩码实现 2D-3D 训练的统一性、并提升 3D 姿态估计质量。
提供用于实时部署和工业用途的开源代码和模型（RTMW/RTMW3D）。

实验结果

研究问题

RQ1RTMW 能否在保持实时推理的同时实现对全身姿态估计（身体、面部、手部、足部）的更高准确性？
RQ2PAFPN 和 HEM 如何影响对低分辨率部位（手部、足部）的定位准确性？
RQ3两阶段蒸馏和数据集对齐是否能提升开源全身姿态的性能，相比 RTMPose？
RQ4基于 SimCC 的坐标分类方法能否在统一训练方案下有效应用于单目 3D 全身姿态估计？
RQ5在 CPU 上，RTMW/RTMW3D 的实际性能（速度/延迟）如何，与之前的开源方法相比？

主要发现

RTMW-l 在 COCO-Wholebody 上达到 70.2 mAP，作为开源模型超过该基准的 70 mAP。
RTMW3D 在 3D 全身姿态估计上显示出强劲性能（COCO-Wholebody 测试范例结果和 H3WB 基准）。
PAFPN 和 HEM 模块显著提升低分辨率部位（手/足）以及整体 Whole-Body AP/AR 的定位。
两阶段蒸馏与对 14 个数据集（对齐到 COCO-Wholebody 133 点框架）的联合训练，提升了相对于 RTMPose 基线的准确性。
RTMW/RTMW3D 在 CPU 上保持具有竞争力的推理速度，适合 ONNXRuntime 的实时部署。
在 3D 中，基于 SimCC 的方法结合根点 z 偏移框架，提供了有效的单目 3D 全身姿态估计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。