Skip to main content
QUICK REVIEW

[论文解读] RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose

Tao Jiang, Peng Lu|arXiv (Cornell University)|Mar 13, 2023
Human Pose and Action Recognition被引用 107
一句话总结

RTMPose 提出一个基于 MMPose 的实时自上而下的多人人姿态估计框架,使用 SimCC 基于坐标分类、CSPNeXt 骨干,以及部署友好优化,在 CPU、GPU 和移动设备上实现高精度、低延迟。

ABSTRACT

Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-time multi-person pose estimation framework, RTMPose, based on MMPose. Our RTMPose-m achieves 75.8% AP on COCO with 90+ FPS on an Intel i7-11700 CPU and 430+ FPS on an NVIDIA GTX 1660 Ti GPU, and RTMPose-l achieves 67.0% AP on COCO-WholeBody with 130+ FPS. To further evaluate RTMPose's capability in critical real-time applications, we also report the performance after deploying on the mobile device. Our RTMPose-s achieves 72.2% AP on COCO with 70+ FPS on a Snapdragon 865 chip, outperforming existing open-source libraries. Code and models are released at https://github.com/open-mmlab/mmpose/tree/1.x/projects/rtmpose.

研究动机与目标

  • 调查影响实时二维多人人姿态估计性能的因素(范式、骨干网络、定位方法、训练、部署)。
  • 开发一个在工业部署中平衡速度与精度的实时姿态估计框架。
  • 展示在不同后端和检测器下对 CPU、GPU 和移动设备的可移植性。
  • 提供开源模型与部署指南,促进行业采用。

提出的方法

  • 采用自上而下的流水线,使用高效检测器和每个人的轻量级姿态估计器。
  • 使用 CSPNeXt 骨干,以实现良好的速度-精度平衡和部署友好性。
  • 使用基于 SimCC 的坐标分类方法(x 和 y 分开)预测关键点,配以高斯柔标签和温度缩放。
  • 引入自注意力细化(Gated Attention Unit)以提升关键点表征。
  • 应用包括 UDP 预训练、EMA、平行余弦学习率,以及两阶段强增强再弱增强的训练策略。
  • 通过跳帧检测、姿态 NMS(OKS 基基)和 OneEuro 平滑来优化推理流程;并在 PyTorch、ONNX Runtime、TensorRT 和 ncnn 上部署。
Figure 1: Comparison of RTMPose and open-source libraries on COCO val set regarding model size, latency, and precision. The circle size represents the relative size of model parameters.
Figure 1: Comparison of RTMPose and open-source libraries on COCO val set regarding model size, latency, and precision. The circle size represents the relative size of model parameters.

实验结果

研究问题

  • RQ1哪种范式、骨干和定位方法在实时多人人姿态估计中实现最佳的速度-精度权衡?
  • RQ2SimCC 基于坐标分类并进行有针对性的训练和架构选择,是否能够在保持或超越热图方法精度的同时降低计算量?
  • RQ3部署优化和平台特定后端如何影响 CPU、GPU 和移动硬件上的实时性能?

主要发现

方法骨干网络检测器检测输入尺寸姿态输入尺寸GFLOPsAP额外数据
PaddleDetection TinyPoseWider NLiteHRNetYOLOv3608x608128x960.0852.3
PaddleDetection TinyPoseWider NLiteHRNetYOLOv3608x608256x1920.3360.9
PaddleDetection TinyPoseWider NLiteHRNetFaster-RCNNN/A128x960.0856.1AIC(220K)
PaddleDetection TinyPoseWider NLiteHRNetFaster-RCNNN/A256x1920.3365.6+Internal(unknown)
PaddleDetection TinyPoseWider NLiteHRNetPicoDet-s320x320128x960.0848.4
PaddleDetection TinyPoseWider NLiteHRNetPicoDet-s320x320256x1920.3356.5
AlphaPoseFastPoseYoloV3608x608256x1925.9171.2-
MMPoseRTMPose-tFaster-RCNNN/A256x1920.3665.8-
MMPoseRTMPose-sFaster-RCNNN/A256x1920.6869.6-
MMPoseRTMPose-mFaster-RCNNN/A256x1921.9373.6-
MMPoseRTMPose-lFaster-RCNNN/A256x1924.1674.8-
MMPoseRTMPose-tYOLOv3608x608256x1920.3666.0AIC(220K)
MMPoseRTMPose-sYOLOv3608x608256x1920.6870.3
MMPoseRTMPose-mYOLOv3608x608256x1921.9374.7
MMPoseRTMPose-lYOLOv3608x608256x1924.1675.7
MMPoseRTMPose-mRTMDet-nano320x320256x1921.9373.2
MMPoseRTMPose-sRTMDet-nano320x320256x1920.6868.5
MMPoseRTMPose-mRTMDet-nano320x320256x1921.9373.2
MMPoseRTMPose-mRTMDet-m640x640256x1921.9375.7
MMPoseRTMPose-lRTMDet-m640x640256x1924.1676.6
  • RTMPose-m 在 COCO 验证集上 AP 达到 75.8%,在 CPU 上超过 90 FPS,在 GTX 1660 Ti GPU 上超过 430 FPS。
  • RTMPose-l 在 COCO 上 AP 为 74.8%,在报告的配置中 GFLOPs 为 76.6,显示出在中等计算量下的强大精度。
  • RTMPose-s 在 COCO 上 AP 72.2%,在 Snapdragon 865 上超过 70 FPS,优于现有开源移动解决方案。
  • 使用 SimCC 配合 CSPNeXt 骨干和基于 GAU 的 refined,与基于热图的方法相比在精度上具有竞争力,同时计算成本更低(如 CT 基线或变换器密集基线)。
  • 两阶段训练(在 COCO 上通过 UDP 进行预训练,然后用强-弱增强微调)和 EMA 将 AP 提升数点(在消融实验中展示)。
  • 跳帧检测和后处理(基于 OKS 的 NMS 和 OneEuro 滤波)降低延迟并在跨帧中提高姿态鲁棒性。
Figure 2: The overall architecture of RTMPose, which contains a convolutional layer, a fully-connected layer and a Gated Attention Unit (GAU) to refine K keypoint representations. After that 2d pose estimation is regarded as two classification tasks for x-axis and y-axis coordinates to predict the h
Figure 2: The overall architecture of RTMPose, which contains a convolutional layer, a fully-connected layer and a Gated Attention Unit (GAU) to refine K keypoint representations. After that 2d pose estimation is regarded as two classification tasks for x-axis and y-axis coordinates to predict the h

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。