Skip to main content
QUICK REVIEW

[论文解读] A simple yet effective baseline for 3d human pose estimation

Julieta Martínez, Rayat Hossain|arXiv (Cornell University)|May 8, 2017
Human Pose and Action Recognition参考文献 48被引用 98
一句话总结

一个轻量级前馈网络将2d关节位置提升到以相机坐标表示的3d,在Human3.6M数据集上达到 state-of-the-art;在使用2d探测器输出时仍表现出色。

ABSTRACT

Following the success of deep convolutional networks, state-of-the-art methods for 3d human pose estimation have focused on deep end-to-end systems that predict 3d joint locations given raw image pixels. Despite their excellent performance, it is often not easy to understand whether their remaining error stems from a limited 2d pose (visual) understanding, or from a failure to map 2d poses into 3-dimensional positions. With the goal of understanding these sources of error, we set out to build a system that given 2d joint locations predicts 3d positions. Much to our surprise, we have found that, with current technology, "lifting" ground truth 2d joint locations to 3d space is a task that can be solved with a remarkably low error rate: a relatively simple deep feed-forward network outperforms the best reported result by about 30\% on Human3.6M, the largest publicly available 3d pose estimation benchmark. Furthermore, training our system on the output of an off-the-shelf state-of-the-art 2d detector (\ie, using images as input) yields state of the art results -- this includes an array of systems that have been trained end-to-end specifically for this task. Our results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.

研究动机与目标

  • 通过将2d姿态估计与2d-to-3d提升分离,激发对3d姿态估计误差来源的理解。
  • 证明一个简单的神经网络能够以较低误差有效地将2d关节映射到3d位置。
  • 在Human3.6M上使用地面真值的2d关节和探测器输出,展示最先进的3d姿态精度。
  • 提供一个轻量、可重复的基线,可以通过可视化证据或更复杂的体系结构进行扩展。

提出的方法

  • 以2d关节位置作为输入,在相机坐标系中预测3d关节位置。
  • 采用深度前馈网络,包含线性层、批归一化、 dropout、ReLU及残差连接。
  • 将地面真值3d姿态旋转/平移到相机坐标系以稳定学习。
  • 采用输入/输出的标准化和围绕髋关节的3d姿态零均值中心化进行训练。
  • 在权重上引入最大范数约束以提高稳定性和泛化能力。
  • 使用现成的2d检测器(Stacked Hourglass)获取2d输入;在可用时对检测器进行微调以提高结果。

实验结果

研究问题

  • RQ1使用简单的神经网络架构,能从2d关节检测中回归出多准确的3d关节?
  • RQ2坐标系选择(相机坐标系)对2d-to-3d提升性能有什么影响?
  • RQ3正则化和结构性选择(批归一化、 dropout、残差连接)如何影响2d-to-3d姿态提升的准确性?
  • RQ4当使用探测器生成的2d关节而非地面真值2d关节时,该基线的鲁棒性如何?

主要发现

  • 在地面真值2d关节上训练和测试时,一个简单的深度前馈网络在Human3.6M上实现37.10 mm的误差,领先于先前的2d-to-3d方法约30%。
  • 在使用2d检测时,该方法仍然达到与端到端像素到3d方法相比的最先进性能,使用SH检测时比先前的最佳方法(Pavlakos等人)提高4.4 mm;在对检测器进行微调后,差距扩大到9.0 mm。
  • 残差连接、批归一化和 dropout 带来显著的误差降低(例如,残差大约节省8–10 mm;移除批归一化/ dropout 会使误差增加3–8 mm)。
  • 将3d姿态预测对齐到相机坐标系至关重要;如果没有相机坐标,误差将超过100 mm,强调了保持一致坐标系的重要性。
  • 该方法速度很快(在64个样本批次上每次前向约3 ms,在批处理模式下约300 fps),轻量级(4–5百万参数),在与快速2d检测器配合时能够实现实时或接近实时部署。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。