Skip to main content
QUICK REVIEW

[Paper Review] Unsupervised Training for 3D Morphable Model Regression

Kyle Genova, Forrester Cole|arXiv (Cornell University)|Jun 15, 2018
Face recognition and analysis27 references18 citations
TL;DR

This paper proposes an unsupervised method to train a deep regression network that maps single images to 3D morphable model (3DMM) parameters using only unlabeled photographs. By leveraging identity features from a pre-trained face recognition network and introducing three novel losses—batch distribution, loopback, and multi-view identity loss—the model achieves state-of-the-art 3D face reconstruction accuracy without any ground-truth 3D supervision, producing recognizable, identity-preserving 3D faces even from challenging images.

ABSTRACT

We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.

Motivation & Objective

  • To address the lack of large-scale, real-world 3D face supervision for training deep regression networks.
  • To enable accurate 3D face reconstruction from single images without requiring ground-truth 3D scans or inverse rendering.
  • To improve generalization and identity preservation in 3D face generation by leveraging robust, pose- and lighting-invariant identity features.
  • To eliminate reliance on synthetic data or iterative optimization by using an unsupervised loss based on deep identity embeddings.

Proposed method

  • The method trains a regression network to predict 3DMM shape and texture parameters from image pixels using only unlabeled images and a pre-trained face recognition network.
  • A differentiable renderer generates synthetic face images from predicted 3DMM parameters, enabling backpropagation through the rendering process.
  • The identity loss compares VGG-Face or FaceNet features between the input image and the rendered 3D face, ensuring identity consistency across varying poses and lighting.
  • The batch distribution loss matches the statistical distribution of predicted 3DMM parameters to the prior distribution of the morphable model, preventing mode collapse.
  • The loopback loss ensures the network can correctly reinterpret its own output by re-encoding the predicted 3D face and reconstructing the same identity features.
  • The multi-view identity loss enhances robustness by computing identity features from multiple independent views of the predicted 3D face and comparing them to the input image's features.

Experimental results

Research questions

  • RQ1Can a 3D face reconstruction network be trained without any 3D supervision or synthetic data?
  • RQ2How can identity consistency be preserved in 3D face reconstruction when the input image varies in pose, lighting, and expression?
  • RQ3What loss functions are effective for unsupervised 3DMM regression that avoid network fooling and mode collapse?
  • RQ4Can a regression network trained on unlabeled images achieve performance comparable to or better than supervised methods?
  • RQ5How robust is the method to challenging real-world conditions such as blur, occlusion, and non-photorealistic inputs?

Key findings

  • On the MoFA-Test dataset, the method achieves a Top-1 identity recall of 87% using VGG-Face features, significantly outperforming MoFA (19%) and Tran et al. (25%).
  • On the larger LFW dataset with 5,749 identities, the method achieves a Top-5 identity recall of 51%, demonstrating strong generalization to diverse identities.
  • The Earth mover’s distance (EMD) between the similarity scores of reconstructed faces and real same-identity pairs on LFW was 0.16, indicating high similarity to real identities.
  • The method produces consistent, recognizable 3D faces even from non-photorealistic artwork, as shown on the BAM dataset, due to the invariance of identity features to stylized pixel details.
  • The model is robust to pose, lighting, expression, occlusion, and blur, as demonstrated on the FERET stress test set.
  • The unsupervised training scheme, combining identity, loopback, and batch distribution losses, successfully avoids mode collapse and network fooling, leading to high-quality 3D reconstructions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.