QUICK REVIEW

[论文解读] Self-supervised Learning of Motion Capture

Hsiao-Yu Fish Tung, Hsiao-Wei Tung|arXiv (Cornell University)|Dec 4, 2017

Advanced Vision and Imaging参考文献 31被引用 131

一句话总结

本文提出了一种用于单目视频的基于学习的动作捕捉模型，该模型在合成数据上进行预训练，并在测试时通过自监督、可微渲染损失对关键点、分割和密集网格运动进行细化，优于传统优化和非自适应基线。

ABSTRACT

Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from differentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and (c) human-background segmentation, in an end-to-end framework. Empirically we show our model combines the best of both worlds of supervised learning and test-time optimization: supervised learning initializes the model parameters in the right regime, ensuring good pose and surface initialization at test time, without manual effort. Self-supervision by back-propagating through differentiable rendering allows (unsupervised) adaptation of the model to the test data, and offers much tighter fit than a pretrained fixed model. We show that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.

研究动机与目标

在没有干净背景或多摄像头设置的情况下，推动单目三维动作捕捉。
开发一个神经模型，从单目视频预测 SMPL 3D 人体网格参数。
利用合成数据进行监督，并通过可微分渲染在测试时进行自监督以实现自适应。
证明测试时自监督比纯监督或纯优化方法产生更紧凳的三维重建。

提出的方法

将 SMPL 作为密集的三维人体网格模型，参数为 theta（姿态）和 beta（形状）。
在合成数据（Surreal）上对网络进行带有 theta 和 beta 回归的监督预训练。
通过对三维关键点、密集网格运动和分割进行可微分渲染，端到端应用自监督损失，然后与检测到的二维对应物进行比较。
自监督损失包括关键点重投影、运动重投影对抗二维光流，以及通过 Chamfer 距离为基础的分割重投影惩罚。
通过光线投射实现可见性，以屏蔽遮挡顶点的运动重投影，并使用反向传播进行训练。
在 Surreal 和 Human3.6M (H3.6M) 上进行评估，并与基于优化的基线和仅预训练的模型进行比较。

实验结果

研究问题

RQ1当使用合成数据进行训练并在测试时通过自监督进行自适应时，神经网络是否能够从单目视频预测 SMPL 参数？
RQ2基于可微分渲染的损失（关键点、运动、分割）是否能够实现准确的三维重建以及从合成数据到真实数据的领域迁移？
RQ3在单目动作捕捉中，测试时自适应是否是超越纯预训练或纯优化方法的关键？
RQ4提出的自监督损失如何彼此互补以提升三维网格和骨架的准确性？

主要发现

表面误差 (mm)	逐关节误差 (mm)	重建误差 (mm)
Optimization	346.5	532.8	1320.1
Optimization + tildeR	301.1	222.0	294.9
Optimization + tildeR + tildeT	272.8	206.6	205.5
Pretrained	119.4	101.6	351.3
Pretrained+Self-Sup	74.5	64.4	203.9
per-joint error (mm)	recon. error (mm)
Optimization	562.4	883.1
Pretrained	125.6	303.5
Pretrained+Self-Sup	98.4	145.8

自监督、测试时自适应比仅预训练或直接优化基线获得更高的三维重建精度。
在 Surreal 上，预训练+自监督模型达到表面误差 74.5 mm、逐关节误差 64.4 mm、重建误差 203.9 mm，优于基线。
在 H3.6M 上，预训练+自监督模型将逐关节误差降至 98.4 mm，重建误差降至 145.8 mm，与优化和预训练基线相比。
消融分析显示三种损失（关键点、分割、运动）互为补充，并共同提升 3D 关键点和网格的准确性。
通过可微分渲染进行自监督实现从合成数据（Surreal）到真实数据（H3.6M）的领域迁移，并获得更好的拟合。
该方法将监督预训练与无监督自适应相结合，在没有手动初始化的情况下实现更紧凑的网格拟合。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。