QUICK REVIEW

[論文レビュー] Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Fei Shen, Cong Wang|ArXiv.org|Feb 13, 2025

Speech and dialogue systems被引用数 3

ひとこと要約

この論文は、アーカイブ済みおよび現在のモーション priors とメモリ効率的な時間的アテンションを用いた Diffusion ベースの Motion-Priors 条件拡散モデル（MCDM）を提案し、TalkingFace 生成におけるロバストな個人同一性の維持と長期的なモーションの一貫性を実現するとともに、TalkingFace-Wild データセットを公開する。

ABSTRACT

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.

研究の動機と目的

TalkingFace 生成における長期的な個人同一性の保持とモーションの一貫性を課題とする。
歴史的（アーカイブ済み）および現在的（現在）モーション priors を拡散ベースの生成に情報提供として活用する。
長期シーケンスでの誤差蓄積を緩和するためのメモリ効率的な時間的アテンションを提案する。
この分野の研究を支援する高品質な多言語 TalkingFace データセットを提供する。

提案手法

アーカイブ済みクリップのモーション priors を統合し、長期の履歴をフレーム整列アテンションで集約して個性コンテキストを強化するモジュールを導入する。
現時点のモーション priors を拡散モデルに取り込み、頭部・口唇・表情モーションをマルチモーダル因果関係（音声、画像、ランドマークが利用可能な場合）で分離・予測する present-clip motion-prior 拡散モデルを提案する。
FiLM ベースのマルチモーダル条件付けと、 archived および present priors のためのクロスアテンションを備えた L 層トランスフォーマ構造の拡散ベースのデノイジングフレームワークを組み込む。
長期のシーケンスにおける誤差蓄積を低減するため、モーションメモリ M_f を更新するメモリ効率的な時間的アテンションを開発し、高速アテンションを用いて蓄積を抑制する。
アーカイブ済み priors、現在 priors、および完全なモーション priors を個別に学習する3段階の学習を行い、 conditioning 中のランドマークは任意とする。
TalkingFace-Wild を公開し、10言語以上・200時間超の多言語データセットを提供してベンチマークと研究を促進する。

Figure 1: Our MCDM architecture. On the upper, the archived-clip motion-prior leverages frame-aligned attention with archived-clip, enhancing identity coherence over extended sequences. On the right, the present-clip motion-prior diffusion model uses multimodal causality and temporal interactions to

実験結果

リサーチクエスチョン

RQ1アーカイブ済みおよび現在のクリップのモーション priors は、長期的な TalkingFace 生成をどう改善するか。
RQ2メモリ効率的な時間的メカニズムは、現実性を損なうことなく長いシーケンスでの誤差蓄積を低減できるか。
RQ3アイデンティティとモーションをマルチモーダル priors でデカップリングすることは、長時間の動画におけるリップ同期と表情の現実感を改善するか。
RQ4ランドマークのガイダンスと音声駆動条件付けのモーション priors 拡散における影響はどうか。
RQ5提案手法 MCDM は、多様で多言語なデータセットにおいて既存手法と比べてどう評価されるか。

主な発見

MCDM は HDTF および CelebV-HQ の標準ベンチマークで、FID、FVD、Sync-C、Sync-D、SSIM、E-FID において最先端の拡散法およびGANベース手法を上回る定量的成果を達成。
TalkingFace-Wild で、MCDM は FID(26.45)、FVD(543.28)、Sync-C(7.84)、Sync-D(8.04)、SSIM(0.824)、E-FID(1.97) において、リストされた手法の中で最良を記録。
MCDM は最も強力な Sync-C を達成し、Sync-D も競合レベルであり、リップ同期と時間的一貫性の改善を示唆。
アブレーション実験では、アーカイブ済みクリップ情報の除去または present-clip 拡散の除去がアイデンティティ、モーション精度、時間的安定性を低下させることを示した。
メモリ効率的な時間的アテンションは、誤差蓄積を抑制し、標準的な時間的アテンションと比較して長期的一貫性を改善する。

Figure 2: The overview of memory-efficient temporal attention. It can dynamically update and integrate historical motion features with current ones.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。