QUICK REVIEW

[Paper Review] What comprises a good talking-head video generation?: A Survey and Benchmark

Lele Chen, Guofeng Cui|arXiv (Cornell University)|May 7, 2020

Face recognition and analysis47 references30 citations

TL;DR

A survey and benchmark of identity-independent talking-head video generation, introducing new perceptual metrics and a uniform evaluation protocol to assess identity preservation, lip synchronization, visual quality, and natural motion.

ABSTRACT

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey.

Motivation & Objective

Define what constitutes a good talking-head video by enumerating desired properties (identity preservation, lip synchronization, visual quality, natural motion).
Critically review existing evaluation metrics and identify their strengths and limitations for talking-head synthesis.
Provide a standardized preprocessing and benchmarking pipeline to enable reproducible evaluation across methods.
Propose and validate new perceptual metrics that capture video-level quality and human-perceived similarity.

Proposed method

Introduce four desiderata for evaluation: identity preserving, lip synchronization, visual quality, and natural-spontaneous motion.
Survey and analyze existing identity-preserving, visual quality, lip-sync, and motion metrics; propose LRSD, ESD, and BSD as new video-level measures.
Develop a uniform preprocessing pipeline including face tracking, cropping, and alignment to enable cross-dataset benchmarking.
Evaluate state-of-the-art identity-independent talking-head methods under various protocols to reveal strengths and weaknesses.

Experimental results

Research questions

RQ1What are the strengths and limitations of current evaluation metrics for talking-head generation?
RQ2Which metrics should be preferred for each of the four desired properties, and can new metrics improve assessment?
RQ3Are the proposed metrics robust to different testing protocols and datasets?
RQ4What are the gaps in current methods regarding lip-sync and spontaneous motion that future work should address?

Key findings

Introduction of three new metrics for video-level evaluation: Lip-Reading Similarity Distance (LRSD), Emotion Similarity Distance (ESD), and Blink Similarity Distance (BSD).
Evidence that many methods struggle with head pose variation between reference and target frames and with accurate semantic lip movements for certain words.
Demonstration that LRSD aligns with human judgments and rankings of videos.
Observation that current models often yield limited spontaneous motion and struggle with naturalistic lip-sync under realistic head motion.
Provision of an open-source benchmark repository to standardize evaluation and facilitate cross-method comparisons.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.