Skip to main content
QUICK REVIEW

[论文解读] DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

Yifeng Ma, Shiwei Zhang|arXiv (Cornell University)|Dec 15, 2023
Face recognition and analysis被引用 9
一句话总结

DreamTalk 利用带去噪网络的扩散模型、一个具风格感知的唇部专家和一个风格预测器,在无需大量风格参考的情况下生成表达丰富、受音频驱动的说话头,改进了口型同步并实现多样的说话风格。

ABSTRACT

Emotional talking head generation has attracted growing attention. Previous methods, which are mainly GAN-based, still struggle to consistently produce satisfactory results across diverse emotions and cannot conveniently specify personalized emotions. In this work, we leverage powerful diffusion models to address the issue and propose DreamTalk, a framework that employs meticulous design to unlock the potential of diffusion models in generating emotional talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network can consistently synthesize high-quality audio-driven face motions across diverse emotions. To enhance lip-motion accuracy and emotional fullness, we introduce a style-aware lip expert that can guide lip-sync while preserving emotion intensity. To more conveniently specify personalized emotions, a diffusion-based style predictor is utilized to predict the personalized emotion directly from the audio, eliminating the need for extra emotion reference. By this means, DreamTalk can consistently generate vivid talking faces across diverse emotions and conveniently specify personalized emotions. Extensive experiments validate DreamTalk's effectiveness and superiority. The code is available at https://github.com/ali-vilab/dreamtalk.

研究动机与目标

  • 推动表达性说话头生成,超越中性表情的局限。
  • 利用扩散模型实现高质量、多样的说话风格。
  • 通过从音频与头像推断风格,消除对昂贵风格参考视频或文本的依赖。
  • 在跨语言与输入条件下,确保精准口型同步同时保持生动表情。

提出的方法

  • 使用基于扩散的去噪网络,并以音频和风格参考视频为条件,合成受音频驱动的面部运动。
  • 引入具风格感知的唇部专家,为口型同步提供引导,保持表达性说话风格。
  • 融入基于扩散的风格预测器,直接从音频(与头像)推断说话风格,减少对风格参考的依赖。

实验结果

研究问题

  • RQ1扩散模型能否在多样化说话风格下生成具表达力的说话头并实现准确口型同步?
  • RQ2如何让口部运动引导具备风格感知,以在表达性与口型同步精度之间取得平衡?
  • RQ3是否可以从音频直接预测个性化的说话风格,而无需参考视频或文本?

主要发现

  • DreamTalk 在口型同步精度与风格表达力方面超越多项最新方法,覆盖多项数据集。
  • 具风格感知的唇部专家在保持强口型同步的同时保留生动表情。
  • 风格预测器可从音频与头像中推断个性化的说话风格,减少对额外风格参考的需求。
  • DreamTalk 展现出对域外头像、多语言语音以及嘈杂音频的鲁棒泛化能力。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。