[论文解读] EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
EmoTalk 将情感从语音中解耦以驱动3D面部动画,实现比以往方法更丰富的情感表达和更好的口型同步,并推出一个大型3D情感对话脸数据集(3D-ETF)。
Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng.github.io/emotalk
研究动机与目标
- 促进包含丰富情感的真实感语音驱动的3D面部动画。
- 将语音内容与情感解耦,以在不与所述内容冲突的情况下提高情感表达。
- 提供一个端到端可训练的框架,能够实现可控的个人风格和情感强度。
提出的方法
- 引入一个情感解耦编码器(EDE),它使用两个音频特征提取器来形成内容和情感潜在空间。
- 使用带混合情感-内容对的跨重建损失来强制解耦。
- 开发一个基于Transformer风格注意力的情感引导特征融合解码器,将融合特征映射到52个blendshape系数。
- 加入速度损失和分类损失,以促进时间稳定性和更好的情感辨识。
- 通过从2D情感数据集推导blendshape标签并应用 blendskinning 获得3D网格,从而构建3D-ETF数据集。
- 通过blendshape系数和FLAME模型兼容性进行2D到3D监督的训练与评估。
实验结果
研究问题
- RQ1语音情感是否能够有效地从内容中解耦,以驱动丰富的3D面部动画?
- RQ2情感引导的融合是否在超越口型同步的同时提升3D面部动作的表达力?
- RQ3能否从2D情感数据集推导的伪3D数据支持在大规模上训练3D情感讲话人脸?
主要发现
| 数据集 | 方法 | LVE (mm) | EVE (mm) |
|---|---|---|---|
| RAVDESS | VOCA | 5.091 | 4.188 |
| RAVDESS | MeshTalk | 3.459 | 3.386 |
| RAVDESS | FaceFormer | 3.247 | 3.757 |
| RAVDESS | Ours | 2.762 | 2.493 |
| HDTF | VOCA | 4.447 | 3.286 |
| HDTF | MeshTalk | 3.886 | 3.124 |
| HDTF | FaceFormer | 3.374 | 3.142 |
| HDTF | Ours | 2.892 | 2.364 |
- EmoTalk 在 RAVDESS 与 HDTF 数据集上实现了低于现有方法的 lip-vertex error (LVE) 与 emotional-vertex error (EVE)。
- On RAVDESS, LVE and EVE are 2.762 mm and 2.493 mm for EmoTalk, outperforming VOCA (5.091, 4.188), MeshTalk (3.459, 3.386), and FaceFormer (3.247, 3.757).
- On HDTF, EmoTalk achieves LVE 2.892 mm and EVE 2.364 mm, better than VOCA (4.447, 3.286), MeshTalk (3.886, 3.124), and FaceFormer (3.374, 3.142).
- Zero-shot evaluation on VOCA-Test shows strong generalization, with EmoTalk outperforming baselines in lip accuracy.
- User studies indicate EmoTalk is preferred over MeshTalk and FaceFormer across full-face realism, lip synchronization, and emotion expression.
- Ablation confirms the importance of the Emotion Disentangling Encoder and emotion-guided multi-head attention for emotion expression.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。