QUICK REVIEW

[论文解读] Fine-Grained 3D Facial Reconstruction for Micro-Expressions

Che Sun, Xinjie Zhang|arXiv (Cornell University)|Mar 7, 2026

Face recognition and analysis被引用 0

一句话总结

提出一种粗到细的3D面部重建方法，通过将全局动态特征与局部丰富的多模态线索相结合，在单目高帧率视频中对微表情 Refin 3D 几何结构。

ABSTRACT

Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.

研究动机与目标

motivate accurate reconstruction of subtle micro-expressions which are often lost in macro-expression-focused methods.
Develop a coarse-to-fine framework that fuses global dynamic features with locally enriched cues from 2D motion, 3D geometry, and facial priors.
Leverage macro-expression data to mitigate micro-expression data scarcity through a dynamic-encoded module.
Refine initialized meshes with a dynamic-guided mesh deformation module that preserves global structure while capturing fine-grained details.

提出的方法

Introduce a plug-and-play dynamic-encoded module that uses a static encoder from onset frames and a motion encoder on optical flow to produce micro-expression enhanced parameters via residual fusion and an N-ODE based evolution.
Apply a dynamic-guided mesh deformation module that fuses multi-modal local features (3D geometry, facial landmarks, and dense optical-flow based motion) and refines meshes through a graph convolutional network with motion-attention.
Use region-based pixel-vertex correspondence to efficiently map optical-flow cues to 3D mesh regions, reducing computational load while preserving discriminability.
Combine reconstruction fidelity losses (photometric, perceptual, landmarks, expression regularization, emotion, expression consistency, identity) with geometry regularization losses (Laplacian smoothness, normal consistency, flow-guided refinement) for training.

实验结果

研究问题

RQ1 Can global dynamic facial features learned from macro-expressions improve the reconstruction of subtle micro-expressions in 3D?
RQ2 Do multi-modal local cues (3D geometry, landmarks, and 2D motion) provide complementary information that enables accurate micro-expression refinement on 3D meshes?
RQ3 Is a coarse-to-fine framework effective for preserving global facial structure while capturing fine-grained micro-expressions from monocular video?
RQ4 How does region-based motion mapping and motion-attentive refinement affect reconstruction fidelity and perceptual realism?

主要发现

Method	CASME II Acc (%)	CASME Acc (%)	SAMM Acc (%)	Avg. Acc (%)	L1 Loss	VGG Loss	FID
EMOCA	40.00	38.93	31.37	36.77	0.085	1.578	112.37
EMICA	42.50	28.81	29.41	33.57	0.083	1.501	100.04
SMIRK	35.00	44.07	45.10	41.39	0.085	1.032	52.26
SMIRK-FT	46.25	42.37	50.98	46.53	0.050	0.745	33.80
Ours	53.75	44.70	56.86	51.77	0.041	0.700	30.41

The proposed Ours method achieves higher micro-expression recognition accuracy across CASME II, CASME, and SAMM (53.75%, 44.70%, 56.86% respectively; Avg 51.77%) compared to EMOCA, EMICA, SMIRK, and SMIRK-FT.
Our method yields the best average WF1 score (45.52%), outperforming baselines especially on CASME II and SAMM.
Reconstruction quality metrics improve with our method, showing lower L1 loss (0.041) and lower VGG loss (0.700) on average than baselines, and better Fréchet Inception Distance (FID 30.41).
Ablation studies demonstrate the dynamic-encoded module as the most impactful component for accuracy, with significant drops when removing DEM or DGMD, and show the importance of multi-modal features and all loss terms.
Region-based motion mapping with motion-attentive refinement substantially contributes to discriminative micro-expression capture while maintaining global geometry

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。