QUICK REVIEW

[論文レビュー] What comprises a good talking-head video generation?: A Survey and Benchmark

Lele Chen, Guofeng Cui|arXiv (Cornell University)|May 7, 2020

Face recognition and analysis参考文献 47被引用数 30

ひとこと要約

アイデンティティ非依存の話者ヘッド動画生成の調査とベンチマークを紹介し、アイデンティティの保持、リップ同期、視覚品質、自然な動作を評価する新しい知覚指標と統一的な評価プロトコルを提案します。

ABSTRACT

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey.

研究の動機と目的

良い話者ヘッド動画を定義するために、望まれる特性（アイデンティティの保持、リップ同期、視覚品質、自然な動作）を列挙する。
既存の評価指標を批判的に検討し、話者ヘッド合成における強みと限界を特定する。
再現性のある評価を可能にする標準化された前処理とベンチマーク設計を提供する。
動画レベルの品質と人間が感じる類似性を捉える新しい知覚指標を提案・検証する。

提案手法

評価の4つの desiderata（アイデンティティの保持、リップ同期、視覚品質、自然・自発的な動作）を導入する。
既存のアイデンティティ保持・視覚品質・リップシンク・モーション指標を調査・分析し、動画レベルの新指標としてLRSD、ESD、BSDを提案する。
顔追跡、クロップ、アライメントを含む統一的な前処理パイプラインを開発し、データセット横断評価を可能にする。
最先端のアイデンティティ非依存話者ヘッド法を様々なプロトコル下で評価し、強みと弱点を明らかにする。

実験結果

リサーチクエスチョン

RQ1現在の話者ヘッド生成の評価指標の強みと限界は何か。
RQ24つの望まれる特性それぞれに対して、どの指標を優先すべきか、そして新しい指標は評価を改善できるか。
RQ3提案された指標は異なる試験プロトコルやデータセットに対して頑健か。
RQ4リップシンクと自発的動作に関して、将来の研究が取り組むべきギャップは何か。

主な発見

動画レベル評価の新指標を3つ導入：Lip-Reading Similarity Distance (LRSD)、Emotion Similarity Distance (ESD)、Blink Similarity Distance (BSD)。
参照フレームとターゲットフレーム間の頭部姿勢の変動や特定の語の意味的リップ動作の正確さに、多くの手法が苦戦している証拠。
LRSDが人間の評価と動画のランキングと整合することを示す。
現行モデルは自発的動作が限られ、現実的な頭部運動下で自然なリップシンクを達成するのが難しい場合があることを観察。
評価を標準化しメソッド間の比較を促進するオープンソースのベンチマークリポジトリを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。