QUICK REVIEW

[论文解读] What comprises a good talking-head video generation?: A Survey and Benchmark

Lele Chen, Guofeng Cui|arXiv (Cornell University)|May 7, 2020

Face recognition and analysis参考文献 47被引用 30

一句话总结

对身份无关的说头视频生成的综述与基准，引入新的感知度量和统一评估协议，以评估身份保持、口型同步、视觉质量和自然运动。

ABSTRACT

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey.

研究动机与目标

通过列举期望属性（身份保持、口型同步、视觉质量、自然而自然的运动）来定义什么是优质的说头视频。
批判性地评估现有评估指标，识别它们在说头合成中的优点与局限。
提供标准化的预处理与基准测试流程，以实现跨方法的可重复评估。
提出并验证新的感知性指标，能够反映视频级别的质量与人类感知的相似性。

提出的方法

引入四个评估期望对象：身份保持、口型同步、视觉质量，以及自然自发运动。
调查并分析现有的身份保持、视觉质量、口型同步和运动度量；提出 LRSD、ESD 和 BSD 作为新的视频级别度量。
开发一个统一的预处理流程，包括人脸跟踪、裁剪和对齐，以实现跨数据集的基准评估。
在多种协议下评估最先进的身份无关说头方法，以揭示其优势与劣势。

实验结果

研究问题

RQ1当前用于说头生成的评估指标有哪些优点与局限？
RQ2对于这四个期望属性，应优先使用哪些指标，新的指标是否能提升评估？
RQ3所提出的指标是否对不同的测试协议和数据集具有鲁棒性？
RQ4就口型同步和自发运动而言，当前方法存在哪些缺口，未来工作应着重解决？

主要发现

引入三个用于视频级评估的新指标：Lip-Reading Similarity Distance (LRSD)、Emotion Similarity Distance (ESD) 和 Blink Similarity Distance (BSD)。
有证据表明，许多方法在参考帧与目标帧之间的头部姿态变化以及对某些词语的语义唇动的准确性方面存在困难。
证明 LRSD 与人类对视频的判断和排序是一致的。
观察到当前模型通常产生受限的自发运动，在现实头部运动下难以实现自然的口型同步。
提供一个开源基准存储库以标准化评估并促进跨方法的比较。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。