[论文解读] Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose
本论文提出了一个神经管道,将来自源音频映射到目标说话人脸视频,具有个性化头部姿态,使用3D脸部重建和记忆增强的GAN以提升真实感。对通用音频到脸部映射进行微调,在一个简短的目标视频上以实现个性化运动和头部姿势。
Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.
研究动机与目标
- Motivate natural talking-face video generation with personalized head pose instead of fixed pose.
- Learn a general audio-to-face mapping and fine-tune it on a short target video to capture individualized head motion and expressions.
- Bridge audio–visual cues through 3D facial animation and rendering to produce realistic frames.
- Refine rendered frames with a memory-augmented GAN that adapts to arbitrary target identities.
提出的方法
- Stage 1: Learn a general audio-to-expression-and-head-pose mapping from audio MFCC features to 3DMM expression and pose using an LSTM network.
- Stage 2: Reconstruct the target’s 3D face from a short video, fine-tune the mapping to capture personalized talking behavior, and obtain stage-2 3D facial animation.
- Render 3D facial animation into frames using target identity texture/illumination, then refine with a memory-augmented GAN that uses identity features and a memory module to adapt across identities.
- Use a memory network to store and retrieve identity features for refinement, enabling one-shot/few-shot personalization.
- Train the GAN with a two-stream conditioning: a window of rendered frames and an identity feature, employing an attention-based generator and PatchGAN-based discriminator.
实验结果
研究问题
- RQ1Can audio alone drive natural lip synchronization while incorporating personalized head pose based on a short target video?
- RQ2How can 3D geometry and rendering be integrated with learning-based refinement to produce realistic talking-face videos for arbitrary identities?
- RQ3Does a memory-augmented GAN enable high-quality, identity-aware frame refinement across different subjects?
- RQ4What is the effect of fine-tuning on a small target video (~300 frames) for personalized head pose adaptation?
主要发现
| Method | PSNR | SSIM | LMD |
|---|---|---|---|
| Chen | 29.65 | 0.73 | 1.73 |
| Wiles | 29.82 | 0.75 | 1.60 |
| You said that | 29.91 | 0.77 | 1.63 |
| DAVS | 29.90 | 0.73 | 1.73 |
| ATVG | 30.91 | 0.81 | 1.37 |
| Ours-G | 30.94 | 0.75 | 1.58 |
- The proposed Ours-P model achieves better image quality, lip synchronization, and naturalness than state-of-the-art methods in subjective user studies.
- Quantitative results on the LRW dataset show Ours-G achieving the highest PSNR (30.94) and competitive SSIM (0.75) and LMD (1.58) compared to Chen, Wiles, You said that, DAVS, and ATVG.
- The method demonstrates good lip synchronization with a comparable or superior quantitative profile to prior methods when using a single input frame (Ours-G).
- Fine-tuning with roughly 300 frames (Ours-P) enables personalized head pose and expressions that outperform fixed-pose baselines in qualitative and user-study assessments.
- A memory-augmented GAN with identity-conditioned refinement yields more realistic textures and facial details across diverse identities than non-memory baselines.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。