[Paper Review] Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set
This paper presents a CNN-based framework for weakly-supervised 3D face reconstruction from a single image using a hybrid image- and perception-level loss, and introduces a confidence-based aggregation network to fuse multiple images for improved 3D shape reconstruction.
Recently, deep learning based 3D face reconstruction methods have shown promising results in both quality and efficiency.However, training deep neural networks typically requires a large volume of data, whereas face images with ground-truth 3D face shapes are scarce. In this paper, we propose a novel deep 3D face reconstruction approach that 1) leverages a robust, hybrid loss function for weakly-supervised learning which takes into account both low-level and perception-level information for supervision, and 2) performs multi-image face reconstruction by exploiting complementary information from different images for shape aggregation. Our method is fast, accurate, and robust to occlusion and large pose. We provide comprehensive experiments on three datasets, systematically comparing our method with fifteen recent methods and demonstrating its state-of-the-art performance.
Motivation & Objective
- Motivate accurate 3D face reconstruction without ground-truth 3D labels by leveraging weak supervision signals such as landmarks, skin masks, and face recognition features.
- Develop a hybrid-level loss that combines low-level photometric information with perception-level (deep feature) supervision to guide learning.
- Propose a skin-color based photometric attention mechanism to improve robustness to occlusion and appearance variations.
- Enable multi-image reconstruction by learning per-coefficient confidence scores to aggregate 3DMM coefficients across an image set.
- Demonstrate state-of-the-art performance on multiple datasets and show fast inference.
Proposed method
- Use a CNN (R-Net) to regress 3D Morphable Model coefficients, illumination, and pose from a single image.
- Train with a hybrid loss: image-level photometric loss with a skin attention mask, a landmark loss, a perception-level loss using a pre-trained face recognition network, and regularization terms on 3DMM coefficients and texture variance.
- Introduce a skin-attention mechanism computed from a naive Bayes skin classifier to weight pixel discrepancies.
- In multi-image settings, learn an auxiliary network (C-Net) to output per-coefficient confidence scores for aggregation, enabling element-wise coefficient fusion across images.
- Aggregate identity coefficients across images as a weighted mean using predicted confidences, allowing pose and lighting diversity to enhance reconstruction.
- Train C-Net in a label-free manner by backpropagating the aggregated single-image reconstructions through the same hybrid losses.
Experimental results
Research questions
- RQ1Can a hybrid image- and perception-level loss improve weakly-supervised 3D face reconstruction from a single image without ground-truth 3D shapes?
- RQ2Does a skin-color based photometric attention improve robustness to occlusion and appearance variation in 3D reconstruction?
- RQ3Can an auxiliary network predict per-coefficient confidences to effectively aggregate multiple face images for a more accurate 3D shape?
- RQ4Does multi-image aggregation using learned confidences outperform naive averaging or global quality scores in unconstrained image sets?
- RQ5How does the proposed method compare to state-of-the-art supervised and unsupervised/weakly-supervised approaches across standard datasets?
Key findings
- Single-image reconstruction with the proposed hybrid losses achieves state-of-the-art accuracy on MICC and FaceWarehouse datasets.
- Joint image-level and perception-level supervision outperforms using either signal alone.
- Skin attention improves robustness to occlusion and challenging appearances (e.g., beards, makeup).
- Multi-image aggregation with element-wise confidence-based coefficient fusion yields better 3D reconstructions than shape averaging and other strategies, approaching supervised performance.
- Across datasets, the method demonstrates robustness to occlusion and large pose, with fast inference times (notably 20 ms per forward pass in certain settings).
- The confidence-Net effectively learns to emphasize high-quality, high-visibility images and can leverage pose differences to improve fusion.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.