QUICK REVIEW

[论文解读] Talking Face Generation by Conditional Recurrent Adversarial Network

Yang Song, Jingwen Zhu|arXiv (Cornell University)|Apr 13, 2018

Speech and Audio Processing参考文献 32被引用 21

一句话总结

本文提出了一种条件循环对抗生成网络，通过在循环单元内联合建模音频和图像特征，生成高保真度的说话人脸视频，实现精准的唇部同步与平滑的面部运动。通过引入时空判别器和唇读判别器，该方法在无需后处理的情况下，实现了视频真实感、唇部同步准确率和视觉质量的最先进性能，在VoxCeleb和LRW数据集上均优于先前方法。

ABSTRACT

Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generating the talking face video with accurate lip synchronization while maintaining smooth transition of both lip and facial movement over the entire video clip. Existing works either do not consider temporal dependency on face images across different video frames thus easily yielding noticeable/abrupt facial and lip movement or are only limited to the generation of talking face video for a specific person thus lacking generalization capacity. We propose a novel conditional video generation network where the audio input is treated as a condition for the recurrent adversarial network such that temporal dependency is incorporated to realize smooth transition for the lip and facial movement. In addition, we deploy a multi-task adversarial training scheme in the context of video generation to improve both photo-realism and the accuracy for lip synchronization. Finally, based on the phoneme distribution information extracted from the audio clip, we develop a sample selection method that effectively reduces the size of the training dataset without sacrificing the quality of the generated video. Extensive experiments on both controlled and uncontrolled datasets demonstrate the superiority of the proposed approach in terms of visual quality, lip sync accuracy, and smooth transition of lip and facial movement, as compared to the state-of-the-art.

研究动机与目标

解决生成具有精准唇部同步与平滑时间过渡的逼真说话人脸视频的挑战。
克服现有方法忽略时间依赖性或在不同人脸与音频输入间泛化能力不足的局限。
通过引入专用判别器的对抗训练，提升图像与视频的真实感。
通过在真实语音视频上预训练的唇读判别器，提升唇部运动的准确性。
扩展框架以建模单人视频生成中的自然面部表情与头部姿态。

提出的方法

一种条件循环对抗生成网络将图像与音频特征整合于循环单元中，以建模面部与唇部运动的时间依赖性。
使用一对时空判别器，以确保单帧图像的逼真性以及序列整体的视频级真实感。
引入唇读判别器，通过对抗训练使生成器输出与语音输入语义对齐的唇部运动。
通过将先前生成的图像帧输入循环单元，同时结合混合特征，扩展网络以建模自然姿态与表情。
采用端到端训练，结合对抗损失、重建损失与感知损失，以提升视觉保真度。
该方法直接处理MFCC特征，无需去模糊或稳定化等后处理步骤。

实验结果

研究问题

RQ1循环对抗生成网络能否有效建模说话人脸生成中面部与唇部运动的时间依赖性？
RQ2唇读判别器是否能显著提升唇部同步准确率，超越像素级重建？
RQ3时空判别器能否在无需额外后处理的情况下提升图像与视频的真实感？
RQ4该框架能否在未见人脸与音频输入下保持高视觉质量与运动平滑性？
RQ5该模型能否在单人视频生成中自然捕捉头部姿态与表情变化？

主要发现

所提方法在生成视频上实现了63.0%的top-5唇读准确率，接近真实视频的80%准确率，表明其具备出色的唇部同步保真度。
用户研究表明，74%的参与者更偏好本方法的唇部运动准确率，超过Chung等人（2017）；87%更偏好其视频真实感，超过Zhou等人（2019）。
在图像质量方面，本方法优于最先进基线模型，73%的用户认为其在伪影与模糊抑制方面优于Chung等人（2017）。
在Obama数据集上，扩展后的模型成功生成了包含自然头部姿态与表情变化的视频，避免了序列生成中常见的面部偏移伪影。
该框架消除了对视频稳定化或去模糊处理流程的需求，直接由生成器输出高质量结果。
消融实验表明，唇读判别器显著提升了唇部同步准确率，与基线相比，top-5准确率相对提升了25%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。