Skip to main content
QUICK REVIEW

[论文解读] Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Carl Doersch, Andrew Zisserman|arXiv (Cornell University)|Jul 4, 2019
Human Pose and Action Recognition被引用 71
一句话总结

本文表明,将运动线索(光流和2D关键点)作为输入给带运动增强的姿态估计器,可以显著提升3D人体姿态估计的sim2real转移,在仅用合成数据训练时即可达到接近最新方法的性能。

ABSTRACT

Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern benchmark for 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.

研究动机与目标

  • Motivate and address the sim2real gap in 3D human pose estimation using synthetic data.
  • Propose motion-based preprocessing (optical flow and 2D keypoints) to bridge the domain gap.
  • Demonstrate that motion cues enable competitive 3D pose performance on real-world data using synthetic training.
  • Construct a synthetic video pipeline with realistic motion and occlusions to train a pose estimator.

提出的方法

  • Extend Human Mesh Recovery (HMR) to handle video with a memory-enabled (LSTM) component called Motion HMR.
  • Preprocess inputs with optical flow (FlowNet) and 2D keypoint heatmaps; concatenate these as additional input channels.
  • Create a synthetic training dataset by compositing SURREAL characters onto real backgrounds with motion, occlusions, and camera motion; include an occlusion generation pipeline using SLIC superpixels.
  • Train end-to-end with a simplified loss from Kinetics pseudo ground truth (Procrustes-aligned 3D keypoint location error and 2D reprojection error).
  • Compare RGB-only, flow-only, keypoints-only, and combinations, and evaluate on 3DPW using PA-MPJPE.

实验结果

研究问题

  • RQ1Can motion-based cues bridge the sim2real gap for 3D human pose estimation from synthetic data?
  • RQ2How do optical flow and 2D keypoints, individually and together, affect transfer performance to real-world data?
  • RQ3Does adding motion information to a baseline pose estimator outperform domain-adversarial approaches like DANN for sim2real transfer?
  • RQ4What is the impact of synthetic dataset construction details (motion-rich backgrounds, occlusions) on transfer performance?
  • RQ5What is the effect of temporal context length on pose estimation accuracy in this setting?

主要发现

  • Motion-based inputs substantially improve sim2real transfer over RGB-only training, with Flow Only achieving 100.1 PA-MPJPE and RGB+Keypoints achieving 82.4.
  • Keypoints Only and Flow+Keypoints achieve the best transfer, with 77.6 and 74.7 PA-MPJPE respectively on 3DPW.
  • RGB + Flow or RGB + Keypoints underperform compared to using motion cues alone or with keypoints, indicating RGB textures lead to overfitting to synthetic appearance.
  • Training on synthetic data with motion cues and occlusion/background realism yields performance competitive with state-of-the-art methods trained on real data (e.g., HMR variants).
  • DANN provides only marginal gains over motion-based cues in this setting, suggesting domain-adversarial training is less effective than explicit motion cues for this task.
  • Ablations show that full motion pipeline with occlusions and moving backgrounds provides a notable boost over static-background baselines.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。