QUICK REVIEW

[论文解读] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Hang Zhou, Yu Liu|arXiv (Cornell University)|Jul 20, 2018

Generative Adversarial Networks and Image Synthesis参考文献 30被引用 37

一句话总结

本文提出了一种解耦的音视频表征框架，用于从音频或视频输入中生成高保真、身份保持的说话人脸。通过联合学习判别性语音和身份表征，并采用关联与对抗性训练，该方法在唇部同步准确度和真实感方面表现优越，优于先前的工作，在生成质量及下游任务（如唇读和音视频检索）中均取得更优结果。

ABSTRACT

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

研究动机与目标

实现任意主体的说话人脸生成，同时保持身份特征并准确同步于语音输入。
通过深度表征学习，解耦说话人脸序列中的主体特异性身份与与语音相关的内容。
将音视频语音识别与音视频同步统一为一个端到端的生成框架。
通过解耦表征提升下游任务（如自动唇读和音视频检索）的性能。
解决数据驱动说话人脸生成中身份与语音信息耦合的挑战。

提出的方法

通过使用词-ID标签作为监督信号，将视频中的唇读结果与音频中的语音识别结果对齐，学习联合音视频嵌入空间。
采用对抗性训练，将语音内容（词-ID）表征与身份（人-ID）表征解耦，确保后者包含最少的语音信息。
采用双编码器架构，从单张参考图像中提取身份特征，从音频或视频片段中提取语音内容特征。
生成器网络通过组合解耦的身份与语音特征生成人脸序列，并使用GAN损失提升真实感。
使用共享分类器与领域自适应训练，以提升特征解耦效果并增强跨主体的泛化能力。
应用对比损失以增强特征判别性，从而提升检索与唇读任务的性能。

实验结果

研究问题

RQ1能否学习到一种解耦的音视频表征，以实现从任意主体生成高质量的说话人脸？
RQ2能否将音频和视频语音输入互换使用，作为身份保持型人脸生成的引导？
RQ3身份与语音内容的对抗性解耦是否能提升唇部同步准确度与视觉质量？
RQ4所学习的表征在多大程度上提升了自动唇读与音视频检索任务的性能？
RQ5联合音视频表征学习能否提升面部运动生成的鲁棒性与解耦性？

主要发现

所提方法在LRW数据集上的唇读任务中达到最先进性能，由于判别性联合音视频表征的引入，识别准确率得到提升。
对抗性解耦将身份编码器中语音内容泄漏的分类准确率从27.8%降低至9.7%（测试样本），证实了解耦的有效性。
定性结果表明，结合共享分类器与对抗性训练显著提升了唇部运动的持续时间与清晰度，优于基线方法。
音视频匹配检索性能达到R@1 = 84.2%，R@10 = 96.7%，Median Rank = 2.1，表明特征对齐能力强大。
唇部同步质量得到改善，解耦后生成结果与真实标签之间唇部关键点的平均L2-范数偏差更低。
该框架支持从音频或视频输入端到端生成，展示了在输入模态上的鲁棒性与灵活性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。