QUICK REVIEW

[論文レビュー] Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Mingwang Xu, Hui Li|arXiv (Cornell University)|Jun 13, 2024

Music and Audio Processing被引用数 5

ひとこと要約

本論文は Hallo というエンドツーエンド拡散モデルを用い、階層的な音声駆動の視覚合成でリップ、表情、ポーズを整合させた肖像アニメーションを実現する。ReferenceNetと時間的整合性を活用して高忠実度と多様なモーションを達成する。

ABSTRACT

The field of portrait image animation, driven by speech audio input, has experienced significant advancements in the generation of realistic and dynamic portraits. This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations within the framework of diffusion-based methodologies. Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion. Our proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference network. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities. Through a comprehensive evaluation that incorporates both qualitative and quantitative analyses, our approach demonstrates obvious enhancements in image and video quality, lip synchronization precision, and motion diversity. Further visualization and access to the source code can be found at: https://fudan-generative-vision.github.io/hallo.

研究の動機と目的

中間的な顔表現を排除し、エンドツーエンドの拶 diffusion で高リアリズムを実現することによる肖像画像アニメーションの進化.
階層的クロスアテンションを介した唇・表情・頭部姿勢間の音声視覚整合を達成すること。
個別のアイデンティティに合わせた表現とポーズの多様性を適応的に制御できるようにすること。
話者音声に driven される talking-head 動画の時間的一貫性と視覚的忠実度を改善すること。

提案手法

エンドツーエンドのフレームワークで UNet ベースのデノイザーを備えた潜在拡散モデルを使用する。
音声と唇・表情・姿勢の特徴間のクロスアテンションを用いた階層的な音声駆動視覚合成を導入する。
アイデンティティを face encoder、音声を wav2vec でエンコードし、階層的クロスアテンションで融合する。
参照画像と動画の時間的整合性を用いて生成を導く ReferenceNet を組み込み、映像の一貫性を確保する。
音声からのモーションを駆動するために、リップ・表情・姿勢領域のマスクベースの注意機構と適応的フュージョンを活用する。
固定エンコーダーでの単一フレーム生成→階層的音声視覚クロスアテンションを用いた動画系列訓練という二段階で訓練する。

Figure 1 : The proposed methodology aims to generate portrait image animations that are temporally consistent and visually high-fidelity. This is achieved by utilizing a reference image, an audio sequence, and optionally, the visual synthesis weight in conjunction with a diffusion model based on the

実験結果

リサーチクエスチョン

RQ1音声駆動のディフュージョンベースの talking-head パイプラインにおいて、厳密なリップ同期と現実的な表情・自然な頭部運動をどう実現するか？
RQ2階層的な音声視覚のクロスアテンションは、アイデンティティを跨いだ音声入力と視覚的な唇/目/口部の動きの整合を改善できるか？
RQ3参照ガイダンスと時間的整合性は、生成肖像動画の忠実度と一貫性にどのような影響を与えるか？

主な発見

表	FID ↓	FVD ↓	Sync-C ↑	Sync-D ↓	E-FID ↓
SadTalker (HDtf)	22.340	203.860	7.885	7.545	9.776
Audio2Head (HDtf)	37.776	239.860	8.024	7.145	17.103
DreamTalk (HDtf)	78.147	790.660	6.376	8.364	15.696
AniPortrait (HDtf)	26.561	234.666	4.015	10.548	13.754
Ours	20.545	173.497	7.750	7.659	7.951
SadTalker (CelebV)	50.015	471.163	6.922	7.921	95.194
Audio2Head (CelebV)	84.793	457.499	8.024	7.145	153.618
DreamTalk (CelebV)	109.011	988.539	5.709	8.743	153.450
AniPortrait (CelebV)	46.915	477.179	2.853	11.709	88.986
Ours (CelebV)	44.578	377.117	7.191	7.984	78.495
SadTalker (Wild)	24.212	249.786	6.613	8.099	37.324
Audio2Head (Wild)	61.510	383.178	5.719	8.585	66.116
DreamTalk (Wild)	128.423	964.088	5.925	8.596	58.180
AniPortrait (Wild)	24.118	250.770	3.043	10.997	37.806
Ours (Wild)	23.266	239.647	6.924	7.969	34.731
Full (Ablation)	20.581	193.062	6.499	8.691	9.133

Hallo は複数のデータセットで、FID、FVD が低く、複数のベースラインと比較してリップシンク指標が競争力があるまたは優れている。
本手法は強力なリップ同期（Sync-C）と許容可能な Sync-D を示し、時間的一貫性（FVD）と画像忠実度（FID）の著しい改善を示す。
リップ・表情・姿勢のクロスアテンションを組み込むアブレーションは、全体的な品質と同期を改善または維持し、階層的な全体設定が最良の結果を達成する。
アイデンティティ特有の refinement による personalization をサポートし、多様な肖像スタイルと音声入力に対して高品質な出力を維持する。

Figure 2 : The overview of the proposed pipeline. Specifically, we integrates a reference image containing a portrait with corresponding audio input to drive portrait animation. Optional visual synthesis weights can be used to balance lip, expression, and pose weights. ReferenceNet encodes global vi

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。