QUICK REVIEW

[논문 리뷰] EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

Ziqiao Peng, Haoyu Wu|arXiv (Cornell University)|2023. 03. 20.

Face recognition and analysis인용 수 9

한 줄 요약

EmoTalk은 음성의 감정과 내용을 분리하여 3D 얼굴 애니메이션을 구동하고, 이전 방법들보다 더 풍부한 감정 표현과 더 나은 입술 싱크를 달성하며, 대규모 3D 감정 대화 얼굴 데이터셋(3D-ETF)을 소개합니다.

ABSTRACT

Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng.github.io/emotalk

연구 동기 및 목표

감정 표현이 포함된 사실적인 음성 구동 3D 얼굴 애니메이션의 동기를 제시한다.
음성 내용과 감정을 구분하여 구사되는 표현력을 개선하고 대화 내용과의 충돌 없이 감정 표현을 향상시킨다.
최종적으로 개인 스타일과 감정 강도를 제어할 수 있는 엔드투엔드 학습 가능 프레임워크를 제공한다.

제안 방법

두 개의 오디오 특징 추출기를 사용하여 내용 잠재공간과 감정 잠재공간을 형성하는 감정 분리 인코더(EDE)를 도입한다.
혼합된 감정-내용 쌍에 대한 교차 재구성 손실을 이용해 분리를 강제한다.
퓨전된 특징을 52개의 블렌드셰이프 계수로 매핑하는 Transformer 유사 어텐션 기반의 감정 유도 특징 융합 디코더를 개발한다.
시간적 안정성과 더 나은 감정 구별을 촉진하기 위해 속도 손실과 분류 손실을 도입한다.
2D 감정 데이터셋에서 블렌드셰이프 라벨을 도출하고 블렌드스킨닝을 적용하여 3D 메쉬를 얻는 방식으로 3D-ETF 데이터셋을 구성한다.
블렌드셰이프 계수와 FLAME 모델 호환성으로 2D-3D 지도 학습을 통해 학습 및 평가를 수행한다.

실험 결과

연구 질문

RQ1음성 감정이 내용을 효과적으로 분리되어 풍부한 3D 얼굴 애니메이션을 구동할 수 있는가?
RQ2감정 유도 융합이 입 모양 동기화 정확도를 넘어 3D 얼굴 움직임의 표현력을 향상시키는가?
RQ32D 감정 데이터셋에서 파생된 의사-3D 데이터가 대규모 3D 감정 대화 얼굴 학습을 지원할 수 있는가?

주요 결과

Dataset	Method	LVE (mm)	EVE (mm)
RAVDESS	VOCA	5.091	4.188
RAVDESS	MeshTalk	3.459	3.386
RAVDESS	FaceFormer	3.247	3.757
RAVDESS	Ours	2.762	2.493
HDTF	VOCA	4.447	3.286
HDTF	MeshTalk	3.886	3.124
HDTF	FaceFormer	3.374	3.142
HDTF	Ours	2.892	2.364

EmoTalk은 RAVDESS 및 HDTF 데이터셋에서 최첨단 방법들보다 낮은 입술 정점 오차(LVE)와 감정 정점 오차(EVE)를 달성했다.
RAVDESS에서 LVE와 EVE는 각각 2.762 mm와 2.493 mm로, EmoTalk이 VOCA(5.091, 4.188), MeshTalk(3.459, 3.386), FaceFormer(3.247, 3.757)보다 우수하다.
HDTF에서 EmoTalk은 LVE 2.892 mm 및 EVE 2.364 mm로, VOCA(4.447, 3.286), MeshTalk(3.886, 3.124), FaceFormer(3.374, 3.142)보다 낫다.
VOCA-테스트에서 제로샷 평가를 통해 강한 일반화가 확인되었고, EmoTalk이 입 모양 정확도에서 베이스라인을 능가한다.
사용자 연구에서 EmoTalk이 MeshTalk 및 FaceFormer에 비해 전체 얼굴 현실감, 입 모양 동기화, 감정 표현에서 우수하다고 나타났다.
Ablation은 Emotion Disentangling Encoder와 감정 안내 다중-헤드 어텐션의 감정 표현 중요성을 확인했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.