QUICK REVIEW

[論文レビュー] Arc2Face: A Foundation Model for ID-Consistent Human Faces

Foivos Paraperas Papantoniou, Alexandros Lattas|arXiv (Cornell University)|Mar 18, 2024

Face recognition and analysis被引用数 5

ひとこと要約

Arc2Face は ArcFace 埋め込みから写真のようにリアルな顔を生成する identity-conditioned ファンデーションモデルで、テキストプロンプトを使わずに WebFace42M のアップサンプリングと Stable Diffusion の微調整により、ID保持と多様性を高水準で実現します。

ABSTRACT

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

研究の動機と目的

高解像度で堅牢な ID 条件付き顔生成の必要性を動機づける。
ArcFace 埋め込みを唯一の条件信号として使用するファンデーションモデルを開発する。
大規模 FR データ（WebFace42M）が ID を保持する生成モデルの訓練に不可欠であることを示す。
テキストプロンプトなしでモデルが優れた ID 忠実度と現実的な多様性を達成することを示す。

提案手法

ArcFace ベクトルを CLIP 潜在空間へ投影する微調整済みエンコーダを介して Stable Diffusion を ArcFace embeddings で条件付ける。
WebFace42M 画像を GFPGAN で復元して 448x448 にアップサンプリングし、FFHQ および CelebA-HQ 上で 512x512 出力になるようファインチューニングして大規模で高品質な訓練データセットを構築する。
WebFace42M の 21000000 枚の復元画像で訓練し、その後 FFHQ および CelebA-HQ で標準的な LD フレームワークでファインチューニングする。
単純な擬似プロンプト“photo of a <id> person”を使用し、<id> トークンを ArcFace embedding に置換して生成を誘導する。
ArcFace cosine similarity を用いて入力IDと生成顔のID忠実度を評価し、LPIPS、表情/姿勢距離、および FID で多様性を評価する。

Figure 1 : Given the ID-embedding from [ 14 ] , Arc2Face can generate high-quality images of any subject with compelling similarity. Using popular extensions, such as ControlNet [ 96 ] , we can explicitly control facial attributes such as the pose or expression.

実験結果

リサーチクエスチョン

RQ1拡張された高解像度の FR データセットだけで ID embedding のみ（ArcFace）が拡散モデルの高解像度顔生成を十分に制約できるか？
RQ2超大規模・高解像度 FR データセットでの訓練は ID保持と画像リアリズムにどう影響するか？
RQ3Arc2Face は CLIP やテキストベースの条件付け手法と比べて、アイデンティティを保持しつつ多様な出力を可能にする点でどうか？

主な発見

方法	LPIPS ↑	実験(ℓ2) ↑	姿勢 (ℓ2) ↑	FID ↓
Synth-500 FastComposer	0.389	3.597	0.163	13.517
AgeDB FastComposer	0.487	4.678	0.225	31.736
Synth-500 Photomaker	0.410	3.920	0.167	13.295
AgeDB Photomaker	0.424	4.283	0.165	8.410
Synth-500 InstantID	0.386	3.733	0.059	22.859
AgeDB InstantID	0.437	4.569	0.082	18.598
Synth-500 IPA-FaceID (SDXL)	0.402	4.648	0.181	7.104
AgeDB IPA-FaceID (SDXL)	0.462	5.812	0.197	24.105
Synth-500 IPA-FaceID-Plus	0.320	2.706	0.150	14.880
AgeDB IPA-FaceID-Plus	0.384	3.518	0.194	11.817
Synth-500 IPA-FaceID-Plusv2	0.356	3.147	0.185	9.752
AgeDB IPA-FaceID-Plusv2	0.429	4.092	0.236	10.798
Synth-500 Arc2Face (Ours)	0.506	6.375	0.317	5.673
AgeDB Arc2Face (Ours)	0.508	5.966	0.273	6.628

Arc2Face は入力 ArcFace 埋め込みと生成顔との間で高いアイデンティティ類似性を達成し、ID保持の点で CLIP ベースの手法を上回る。
テキストプロンプトなしで強力なID忠実度と姿勢・表現の多様性を提供する。
WebFace42M への訓練（高解像度へアップサンプリング）は FFHQ のみより識別保持を大幅に改善し、百万規模の FR データの必要性を強調する。
Arc2Face は ControlNet と組み合わせて 3DMM由来の法線で姿勢・表現を制御でき、姿勢/表現を意識した合成を可能にする。
合成顔データの実験では、Arc2Face 訓練済み FR モデルは、合成データで訓練した場合、標準ベンチマーク（例：LFW, CFP-FP, CPLFW, AgeDB, CALFW）で競争力あるか、それを上回ることがある。

Figure 2 : Overview of Arc2Face. We use a straightforward design to condition Stable Diffusion on ID features. The ArcFace embedding is processed by the text encoder using a frozen pseudo-prompt for compatibility, allowing projection into the CLIP latent space for cross-attention control. Both the e

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。