QUICK REVIEW

[Paper Review] Unsupervised Training for 3D Morphable Model Regression

Kyle Genova, Forrester Cole|arXiv (Cornell University)|Jun 15, 2018

Face recognition and analysis27 references18 citations

TL;DR

This paper proposes an unsupervised method to train a deep regression network that maps single images to 3D morphable model (3DMM) parameters using only unlabeled photographs. By leveraging identity features from a pre-trained face recognition network and introducing three novel losses—batch distribution, loopback, and multi-view identity loss—the model achieves state-of-the-art 3D face reconstruction accuracy without any ground-truth 3D supervision, producing recognizable, identity-preserving 3D faces even from challenging images.

ABSTRACT

We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.

Motivation & Objective

To address the lack of large-scale, real-world 3D face supervision for training deep regression networks.
To enable accurate 3D face reconstruction from single images without requiring ground-truth 3D scans or inverse rendering.
To improve generalization and identity preservation in 3D face generation by leveraging robust, pose- and lighting-invariant identity features.
To eliminate reliance on synthetic data or iterative optimization by using an unsupervised loss based on deep identity embeddings.

Proposed method

The method trains a regression network to predict 3DMM shape and texture parameters from image pixels using only unlabeled images and a pre-trained face recognition network.
A differentiable renderer generates synthetic face images from predicted 3DMM parameters, enabling backpropagation through the rendering process.
The identity loss compares VGG-Face or FaceNet features between the input image and the rendered 3D face, ensuring identity consistency across varying poses and lighting.
The batch distribution loss matches the statistical distribution of predicted 3DMM parameters to the prior distribution of the morphable model, preventing mode collapse.
The loopback loss ensures the network can correctly reinterpret its own output by re-encoding the predicted 3D face and reconstructing the same identity features.
The multi-view identity loss enhances robustness by computing identity features from multiple independent views of the predicted 3D face and comparing them to the input image's features.

Experimental results

Research questions

RQ1Can a 3D face reconstruction network be trained without any 3D supervision or synthetic data?
RQ2How can identity consistency be preserved in 3D face reconstruction when the input image varies in pose, lighting, and expression?
RQ3What loss functions are effective for unsupervised 3DMM regression that avoid network fooling and mode collapse?
RQ4Can a regression network trained on unlabeled images achieve performance comparable to or better than supervised methods?
RQ5How robust is the method to challenging real-world conditions such as blur, occlusion, and non-photorealistic inputs?

Key findings

On the MoFA-Test dataset, the method achieves a Top-1 identity recall of 87% using VGG-Face features, significantly outperforming MoFA (19%) and Tran et al. (25%).
On the larger LFW dataset with 5,749 identities, the method achieves a Top-5 identity recall of 51%, demonstrating strong generalization to diverse identities.
The Earth mover’s distance (EMD) between the similarity scores of reconstructed faces and real same-identity pairs on LFW was 0.16, indicating high similarity to real identities.
The method produces consistent, recognizable 3D faces even from non-photorealistic artwork, as shown on the BAM dataset, due to the invariance of identity features to stylized pixel details.
The model is robust to pose, lighting, expression, occlusion, and blur, as demonstrated on the FERET stress test set.
The unsupervised training scheme, combining identity, loopback, and batch distribution losses, successfully avoids mode collapse and network fooling, leading to high-quality 3D reconstructions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.