QUICK REVIEW

[论文解读] Learning to Estimate 3D Human Pose and Shape from a Single Color Image

Georgios Pavlakos, Luyang Zhu|arXiv (Cornell University)|May 10, 2018

Human Pose and Action Recognition参考文献 52被引用 22

一句话总结

本文提出了一种端到端的深度学习框架，通过SMPL参数化人体模型，从单张彩色图像中估计详细的3D人体姿态与形状。通过卷积神经网络（ConvNets）从2D关键点和掩码预测SMPL参数，并采用可微渲染和3D逐顶点损失进行端到端训练，该方法在50ms内完成推理，性能达到SOTA，速度比迭代优化基线快3倍以上。

ABSTRACT

This work addresses the problem of estimating the full body 3D human pose and shape from a single color image. This is a task where iterative optimization-based solutions have typically prevailed, while Convolutional Networks (ConvNets) have suffered because of the lack of training data and their low resolution 3D predictions. Our work aims to bridge this gap and proposes an efficient and effective direct prediction method based on ConvNets. Central part to our approach is the incorporation of a parametric statistical body shape model (SMPL) within our end-to-end framework. This allows us to get very detailed 3D mesh results, while requiring estimation only of a small number of parameters, making it friendly for direct network prediction. Interestingly, we demonstrate that these parameters can be predicted reliably only from 2D keypoints and masks. These are typical outputs of generic 2D human analysis ConvNets, allowing us to relax the massive requirement that images with 3D shape ground truth are available for training. Simultaneously, by maintaining differentiability, at training time we generate the 3D mesh from the estimated parameters and optimize explicitly for the surface using a 3D per-vertex loss. Finally, a differentiable renderer is employed to project the 3D mesh to the image, which enables further refinement of the network, by optimizing for the consistency of the projection with 2D annotations (i.e., 2D keypoints or masks). The proposed approach outperforms previous baselines on this task and offers an attractive solution for direct prediction of 3D shape from a single color image.

研究动机与目标

解决从单目彩色图像中估计完整3D人体姿态与形状的挑战，该任务传统上由缓慢的迭代优化方法主导。
通过利用参数化人体模型，克服卷积神经网络在3D人体重建中的局限性，如缺乏训练数据和低分辨率的3D预测结果。
通过仅使用2D监督（关键点和掩码）实现无需3D形状标注的直接3D预测。
通过引入3D逐顶点损失和可微渲染，提升训练稳定性和准确性，确保与2D标注的一致性。
证明直接预测可作为迭代优化方法（如SMPLify）的有效初始化和锚点，加速收敛并提升结果质量。

提出的方法

将SMPL参数化人体模型集成到端到端的深度学习框架中，仅用82个参数（72个姿态 + 10个体型）表示3D人体形状。
训练两个独立的网络：PosePrior用于从2D关键点热力图回归SMPL姿态参数（θ），ShapePrior用于从2D轮廓回归形状参数（β）。
使用可微渲染器将预测的3D网格投影回2D图像空间，通过2D关键点和掩码一致性实现监督。
通过3D逐顶点损失进行优化，最小化预测与真实3D网格之间的顶点级误差，提升表面精度。
通过同时使用2D监督（关键点、掩码）和3D监督（逐顶点损失）进行端到端微调，实现在无需3D形状标注情况下的强泛化能力。
以网络预测的3D姿态作为SMPLify优化流程的锚点，加速收敛并提升拟合质量，引入姿态正则化项E_anchor(θ)。

实验结果

研究问题

RQ1深度卷积神经网络能否在无需3D形状标注的情况下，直接从单张彩色图像预测出详细的3D人体形状与姿态？
RQ22D监督（关键点和掩码）在多大程度上足以训练网络以准确预测3D SMPL参数？
RQ3引入可微渲染和3D逐顶点损失如何提升3D人体重建的质量与泛化能力？
RQ4网络的直接3D预测能否作为SMPLify等迭代优化方法的有效初始化？
RQ5与迭代优化相比，采用直接深度学习方法在准确率与推理速度之间存在何种权衡？

主要发现

所提方法在3D人体姿态与形状估计的基准数据集上达到SOTA性能，优于直接预测和迭代优化的基线方法。
模型在Titan X GPU上推理速度为50ms，相比迭代SMPLify（每张图像需1–3分钟）实现了超过三个数量级的速度提升。
当用作SMPLify的锚点时，直接预测可将分割准确率提升至64.62%（对比63.98%），并将运行时间缩短三倍。
锚定后的SMPLify版本在LSP测试集上实现92.17%的前景分割准确率和64.62%的F1分数，接近使用真实2D标注的SMPLify性能。
训练过程中使用3D逐顶点损失可使预测结果与标准3D评估指标的相关性优于简单的参数回归方法。
该框架支持端到端训练，无需3D形状真实值，仅依赖2D关键点和掩码标注，显著降低了数据依赖性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。