[论文解读] Fine-Grained 3D Facial Reconstruction for Micro-Expressions
提出一种粗到细的3D面部重建方法,通过将全局动态特征与局部丰富的多模态线索相结合,在单目高帧率视频中对微表情 Refin 3D 几何结构。
Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.
研究动机与目标
- motivate accurate reconstruction of subtle micro-expressions which are often lost in macro-expression-focused methods.
- Develop a coarse-to-fine framework that fuses global dynamic features with locally enriched cues from 2D motion, 3D geometry, and facial priors.
- Leverage macro-expression data to mitigate micro-expression data scarcity through a dynamic-encoded module.
- Refine initialized meshes with a dynamic-guided mesh deformation module that preserves global structure while capturing fine-grained details.
提出的方法
- Introduce a plug-and-play dynamic-encoded module that uses a static encoder from onset frames and a motion encoder on optical flow to produce micro-expression enhanced parameters via residual fusion and an N-ODE based evolution.
- Apply a dynamic-guided mesh deformation module that fuses multi-modal local features (3D geometry, facial landmarks, and dense optical-flow based motion) and refines meshes through a graph convolutional network with motion-attention.
- Use region-based pixel-vertex correspondence to efficiently map optical-flow cues to 3D mesh regions, reducing computational load while preserving discriminability.
- Combine reconstruction fidelity losses (photometric, perceptual, landmarks, expression regularization, emotion, expression consistency, identity) with geometry regularization losses (Laplacian smoothness, normal consistency, flow-guided refinement) for training.
实验结果
研究问题
- RQ1 Can global dynamic facial features learned from macro-expressions improve the reconstruction of subtle micro-expressions in 3D?
- RQ2 Do multi-modal local cues (3D geometry, landmarks, and 2D motion) provide complementary information that enables accurate micro-expression refinement on 3D meshes?
- RQ3 Is a coarse-to-fine framework effective for preserving global facial structure while capturing fine-grained micro-expressions from monocular video?
- RQ4 How does region-based motion mapping and motion-attentive refinement affect reconstruction fidelity and perceptual realism?
主要发现
| Method | CASME II Acc (%) | CASME Acc (%) | SAMM Acc (%) | Avg. Acc (%) | L1 Loss | VGG Loss | FID |
|---|---|---|---|---|---|---|---|
| EMOCA | 40.00 | 38.93 | 31.37 | 36.77 | 0.085 | 1.578 | 112.37 |
| EMICA | 42.50 | 28.81 | 29.41 | 33.57 | 0.083 | 1.501 | 100.04 |
| SMIRK | 35.00 | 44.07 | 45.10 | 41.39 | 0.085 | 1.032 | 52.26 |
| SMIRK-FT | 46.25 | 42.37 | 50.98 | 46.53 | 0.050 | 0.745 | 33.80 |
| Ours | 53.75 | 44.70 | 56.86 | 51.77 | 0.041 | 0.700 | 30.41 |
- The proposed Ours method achieves higher micro-expression recognition accuracy across CASME II, CASME, and SAMM (53.75%, 44.70%, 56.86% respectively; Avg 51.77%) compared to EMOCA, EMICA, SMIRK, and SMIRK-FT.
- Our method yields the best average WF1 score (45.52%), outperforming baselines especially on CASME II and SAMM.
- Reconstruction quality metrics improve with our method, showing lower L1 loss (0.041) and lower VGG loss (0.700) on average than baselines, and better Fréchet Inception Distance (FID 30.41).
- Ablation studies demonstrate the dynamic-encoded module as the most impactful component for accuracy, with significant drops when removing DEM or DGMD, and show the importance of multi-modal features and all loss terms.
- Region-based motion mapping with motion-attentive refinement substantially contributes to discriminative micro-expression capture while maintaining global geometry
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。