QUICK REVIEW

[논문 리뷰] Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Fei Shen, Cong Wang|ArXiv.org|2025. 02. 13.

Speech and dialogue systems인용 수 3

한 줄 요약

본 논문은 MCDM(Motion-Priors Conditional Diffusion Model)을 제안하며, 아카이브된 및 현재 모션 프라이어와 메모리 효율적인 시간적 주의력을 활용하여 TalkingFace 생성에서 견고한 신원 보존과 장기 모션 일관성을 달성하고 TalkingFace-Wild 데이터셋을 공개한다.

ABSTRACT

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the extbf{M}otion-priors extbf{C}onditional extbf{D}iffusion extbf{M}odel ( extbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the extbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.

연구 동기 및 목표

TalkingFace 생성에서 장기적인 신원 보존 및 모션 일관성을 다룬다.
역사적(아카이브된) 및 현재(현시) 모션 프라이어를 활용하여 확산 기반 생성을 안내한다.
장시간 시퀀스에서의 오차 누적을 완화하기 위한 메모리 효율적 시간적 주의 메커니즘을 제안한다.
이 분야의 연구를 지원하기 위한 고품질 다국어 TalkingFace 데이터셋을 제공한다.

제안 방법

아카이브된 클립 모션 프라이어 모듈을 도입하여 프레임 정렬된 주의를 통해 장기 이력을 모으고 신원 맥락을 강화한다.
현재 클립 모션 프라이어 확산 모델을 제안하여 머리 움직임, 입 모양, 표정 모션을 다중 모드 인과관계(가능한 경우 음성, 영상, 랜드마크)를 사용해 분리하고 예측한다.
FiLM 기반 다중 모달 컨디셔닝과 archived와 present priors에 대한 크로스 어텐션을 갖춘 L-레이어 트랜스포머 구조의 확산 기반 디노이징 프레임워크를 도입한다.
모션 메모리 M_f를 업데이트하고 확장된 시퀀스에서의 오차 누적을 줄이기 위해 빠른 어텐션을 사용하는 메모리 효율적 시간적 주의 구현.
아카이브 priors, present priors, 전체 모션 priors를 별도로 학습하기 위해 세 단계로 학습하며 컨디셔닝 시 랜드마크는 선택적으로 사용한다.
벤치마킹 및 연구를 위해 10개 언어에 걸친 200시간이 넘는 다국어 데이터셋 TalkingFace-Wild를 공개한다.

Figure 1: Our MCDM architecture. On the upper, the archived-clip motion-prior leverages frame-aligned attention with archived-clip, enhancing identity coherence over extended sequences. On the right, the present-clip motion-prior diffusion model uses multimodal causality and temporal interactions to

실험 결과

연구 질문

RQ1아카이브된 클립과 현재 클립의 모션 프라이어가 어떻게 장기 TalkingFace 생성을 향상시킬 수 있는가?
RQ2현실감을 해치지 않으면서 메모리 효율적인 시간 메커니즘이 확산 기반의 긴 시퀀스에서의 오차 누적을 줄일 수 있는가?
RQ3다중 모달 프라이어를 통한 신원과 모션의 분리가 장시간 비디오에서 입 모션 동기화와 표정 리얼리즘을 개선하는가?
RQ4모션 프라이어 확산에서 랜드마크 가이드의 영향과 오디오 주도 컨디셔닝의 영향은 무엇인가?
RQ5제안된 MCDM이 다양한 다국어 데이터셋에서 기존 방법과 비교해 어떤 성능을 보이나?

주요 결과

MCDM은 HDTF와 CelebV-HQ에서 우수한 정량적 성능을 달성하고, 표준 벤치마크에서 FID, FVD, Sync-C, Sync-D, SSIM, E-FID에서 최첨단 확산 및 GAN 기반 방법을 능가한다.
TalkingFace-Wild에서 MCDM은 나열된 방법들 중 최상의 FID(26.45), FVD(543.28), Sync-C(7.84), Sync-D(8.04), SSIM(0.824), E-FID(1.97)를 기록한다.
MCDM은 가장 강력한 Sync-C를 달성하고 경쟁력 있는 Sync-D를 보여주며 이는 입 모션 동기화 및 시간적 일관성의 개선을 나타낸다.
변수실험에서 아카이브 클립 정보 제거나 현재 클립 확산을 제거하면 신원, 모션 정확도 및 시간적 안정성이 저하된다.
메모리 효율적 시간적 주의는 표준 시간적 주의에 비해 오차 누적을 줄이고 장기 일관성을 개선한다.

Figure 2: The overview of memory-efficient temporal attention. It can dynamically update and integrate historical motion features with current ones.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.