[Paper Review] Unsupervised Training for 3D Morphable Model Regression
This paper proposes an unsupervised method to train a deep regression network that maps single images to 3D morphable model (3DMM) parameters using only unlabeled photographs. By leveraging identity features from a pre-trained face recognition network and introducing three novel losses—batch distribution, loopback, and multi-view identity loss—the model achieves state-of-the-art 3D face reconstruction accuracy without any ground-truth 3D supervision, producing recognizable, identity-preserving 3D faces even from challenging images.
We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.
Motivation & Objective
- To address the lack of large-scale, real-world 3D face supervision for training deep regression networks.
- To enable accurate 3D face reconstruction from single images without requiring ground-truth 3D scans or inverse rendering.
- To improve generalization and identity preservation in 3D face generation by leveraging robust, pose- and lighting-invariant identity features.
- To eliminate reliance on synthetic data or iterative optimization by using an unsupervised loss based on deep identity embeddings.
Proposed method
- The method trains a regression network to predict 3DMM shape and texture parameters from image pixels using only unlabeled images and a pre-trained face recognition network.
- A differentiable renderer generates synthetic face images from predicted 3DMM parameters, enabling backpropagation through the rendering process.
- The identity loss compares VGG-Face or FaceNet features between the input image and the rendered 3D face, ensuring identity consistency across varying poses and lighting.
- The batch distribution loss matches the statistical distribution of predicted 3DMM parameters to the prior distribution of the morphable model, preventing mode collapse.
- The loopback loss ensures the network can correctly reinterpret its own output by re-encoding the predicted 3D face and reconstructing the same identity features.
- The multi-view identity loss enhances robustness by computing identity features from multiple independent views of the predicted 3D face and comparing them to the input image's features.
Experimental results
Research questions
- RQ1Can a 3D face reconstruction network be trained without any 3D supervision or synthetic data?
- RQ2How can identity consistency be preserved in 3D face reconstruction when the input image varies in pose, lighting, and expression?
- RQ3What loss functions are effective for unsupervised 3DMM regression that avoid network fooling and mode collapse?
- RQ4Can a regression network trained on unlabeled images achieve performance comparable to or better than supervised methods?
- RQ5How robust is the method to challenging real-world conditions such as blur, occlusion, and non-photorealistic inputs?
Key findings
- On the MoFA-Test dataset, the method achieves a Top-1 identity recall of 87% using VGG-Face features, significantly outperforming MoFA (19%) and Tran et al. (25%).
- On the larger LFW dataset with 5,749 identities, the method achieves a Top-5 identity recall of 51%, demonstrating strong generalization to diverse identities.
- The Earth mover’s distance (EMD) between the similarity scores of reconstructed faces and real same-identity pairs on LFW was 0.16, indicating high similarity to real identities.
- The method produces consistent, recognizable 3D faces even from non-photorealistic artwork, as shown on the BAM dataset, due to the invariance of identity features to stylized pixel details.
- The model is robust to pose, lighting, expression, occlusion, and blur, as demonstrated on the FERET stress test set.
- The unsupervised training scheme, combining identity, loopback, and batch distribution losses, successfully avoids mode collapse and network fooling, leading to high-quality 3D reconstructions.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.