QUICK REVIEW

[论文解读] A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering

Shih-Yang Su, Frank Yu|arXiv (Cornell University)|Feb 11, 2021

Advanced Vision and Imaging被引用 35

一句话总结

A-NeRF 提出了一种自监督的、测试时优化的方法，利用神经辐射场和可动骨骼嵌入，对单目人体3D姿态进行精细化处理，实现了从单个未标定相机中高保真度的3D身体形状与姿态重建，无需预先存在的3D模型或真实标签。该方法优于纯判别式方法，并在多视角下具有良好的泛化能力。

ABSTRACT

While deep learning has reshaped the classical motion capture pipeline, generative, analysis-by-synthesis elements are still in use to recover fine details if a high-quality 3D model of the user is available. Unfortunately, obtaining such a model for every user a priori is challenging, time-consuming, and limits the application scenarios. We propose a novel test-time optimization approach for monocular motion capture that learns a volumetric body model of the user in a self-supervised manner. To this end, our approach combines the advantages of neural radiance fields with an articulated skeleton representation. Our proposed skeleton embedding serves as a common reference that links constraints across time, thereby reducing the number of required camera views from traditionally dozens of calibrated cameras, down to a single uncalibrated one. As a starting point, we employ the output of an off-the-shelf model that predicts the 3D skeleton pose. The volumetric body shape and appearance is then learned from scratch, while jointly refining the initial pose estimate. Our approach is self-supervised and does not require any additional ground truth labels for appearance, pose, or 3D shape. We demonstrate that our novel combination of a discriminative pose estimation technique with surface-free analysis-by-synthesis outperforms purely discriminative monocular pose estimation approaches and generalizes well to multiple views.

研究动机与目标

解决在单目动作捕捉中为每个用户预先获取高质量3D身体模型的挑战。
减少对数十个标定相机的依赖，实现仅通过单个未标定相机进行精确的3D重建。
在自监督方式下，联合学习体素化身体形状与外观，同时对初始3D姿态估计进行优化。
在训练或推理过程中消除对真实外观、姿态或3D形状标签的需求。
在性能和泛化能力方面优于纯判别式单目姿态估计方法。

提出的方法

该方法将神经辐射场（NeRF）与可动骨骼表示相结合，联合优化3D身体形状、外观和姿态。
引入骨骼嵌入作为共享的时间参考，以在帧间保持一致性，并减少对多视角相机的依赖。
该方法从预训练的3D姿态估计网络输出出发，通过测试时优化进行细化。
仅使用单目视频输入和自监督监督，从零开始学习体素化身体形状和外观。
优化过程以可微分方式进行，允许仅通过单个未标定相机端到端优化3D几何结构和姿态。
该方法利用分析-合成原理，无需显式表面监督或显式3D模型监督。

实验结果

研究问题

RQ1单个未标定相机是否足以在无需预先3D模型的情况下实现高保真度的3D人体身体重建？
RQ2自监督方法如何能从单目视频中联合优化3D姿态与体素化身体形状？
RQ3骨骼嵌入能否作为稳定的时间参考，以减少对多个标定视角的需求？
RQ4将判别式姿态估计与无表面分析-合成方法结合，是否优于纯判别式方法？
RQ5该方法在无显式多视角监督的情况下，跨多视角的泛化能力如何？

主要发现

该方法在无需真实3D形状或外观标签的情况下，实现了单目3D人体姿态估计的最先进性能。
成功地从单个未标定相机重建出细节丰富的3D身体形状与姿态，显著减少了对数十个标定相机的需求。
骨骼嵌入的使用实现了稳定的时序建模，并提升了序列中重建的保真度。
该方法在多视角设置下表现出良好的泛化能力，展现出超越单视角设置的鲁棒性。
自监督训练范式实现了形状、外观与姿态的联合优化，且无需额外监督。
在3D关键点精度和几何一致性方面，该方法优于纯判别式单目姿态估计基线方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。