QUICK REVIEW

[论文解读] SDFR: Synthetic Data for Face Recognition Competition

Hatef Otroshi Shahreza, Christophe Ecabert|arXiv (Cornell University)|Apr 6, 2024

Face recognition and analysis被引用 7

一句话总结

本论文总结了评估使用合成数据训练人脸识别模型的 SDFR 竞赛，在两个任务中有不同的约束，并分析结果、偏见与未来方向。

ABSTRACT

Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns. With the recent advances in generative models, recently several works proposed generating synthetic face recognition datasets to mitigate concerns in web-crawled face recognition datasets. This paper presents the summary of the Synthetic Data for Face Recognition (SDFR) Competition held in conjunction with the 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024) and established to investigate the use of synthetic data for training face recognition models. The SDFR competition was split into two tasks, allowing participants to train face recognition systems using new synthetic datasets and/or existing ones. In the first task, the face recognition backbone was fixed and the dataset size was limited, while the second task provided almost complete freedom on the model backbone, the dataset, and the training pipeline. The submitted models were trained on existing and also new synthetic datasets and used clever methods to improve training with synthetic data. The submissions were evaluated and ranked on a diverse set of seven benchmarking datasets. The paper gives an overview of the submitted face recognition models and reports achieved performance compared to baseline models trained on real and synthetic datasets. Furthermore, the evaluation of submissions is extended to bias assessment across different demography groups. Lastly, an outlook on the current state of the research in training face recognition models using synthetic data is presented, and existing problems as well as potential future directions are also discussed.

研究动机与目标

出于法律与伦理原因，推动对网页抓取人脸数据集的隐私友好替代方案。
评估在受限与不受限设置下，合成数据如何训练出具有竞争力的面部识别模型。
在使用合成数据时，评估在不同人口群体上的偏差与公平性。
为改进基于合成数据的人脸识别研究提供见解和建议。

提出的方法

两项任务：Task 1 使用固定骨干网络 (iResNet-50) 且最多 1M 张合成图像；Task 2 使用不受限的骨干网络和数据，仅在合成数据上训练。
参与者可以使用现有的合成数据集（如 IDiff-Face、DigiFace）或生成新的数据集，规则排除用于生成器训练的身份标签网页数据。
提交的模型以 ONNX 形式提交，在七个基准数据集上进行评估（LFW、CFP-FP、CPLFW、AgeDB-30、CALFW、IJB-B、IJB-C）。
评估使用按数据集的排名通过 Borda count 来产生最终名次。
数据增强、损失函数（AdaFace、ArcFace、类似 ArcFace 的损失函数），以及姿态/质量增强以弥合合成-真实差距，是提升性能的关键。

Figure 1 : Sample original and augmented face images from the IDiff-Face (Uniform) dataset obtained by synthetically rotating face pose in the yaw axis used by the BOVIFOCR-UFPR team in task 2.

实验结果

研究问题

RQ1合成人脸数据集在受限/不受限设置下能否达到与真实数据训练相近的面部识别性能？
RQ2使用多种合成数据源（如 IDiff-Face、DigiFace、StyleGAN2/3）进行训练的益处与局限性有哪些？
RQ3相较于真实数据基线，合成数据训练如何影响不同人口群体（种族、族裔）的偏差？
RQ4哪些数据增强与数据集策划策略最有效地缩小合成数据与真实数据的性能差距？

主要发现

Task / Dataset	Method / Team	No. Images	LFW	CALFW	CPLFW	AgeDB-30	CFP-FP	IJB-B	IJB-C	Rank
Task 1 Baselines	MS-Celeb (real)	5.8M	99.82	95.92	92.52	97.62	96.01	94.88	96.23	-
Task 1 Baselines	WebFace-4M (real)	4M	99.78	96.02	93.90	97.52	97.19	95.52	97.02	-
Task 1 Baselines	DigiFace (synthetic)	1.2M	90.63	74.02	71.38	65.03	78.11	38.89	45.09	-
Task 1 Leaderboard	BioLab	500K	96.97	89.12	76.80	83.77	77.34	60.21	63.56	4
Task 1 Leaderboard	BOVIFOCR-UFPR	500K	97.53	89.38	80.07	83.90	84.37	12.70	13.71	3
Task 2 Leaderboard	BioLab	1.7M	98.33	90.87	84.45	87.85	88.11	76.94	81.25	1

以合成数据训练的提交模型在对比仅使用合成数据的基线时有所提升，但与真实网页抓取数据集相比仍存在差距，尤其是在具有挑战性的 IJB-B 与 IJB-C 上。
BioLab Task 2 提交通过结合 DigiFace 与 IDiff-Face 合成数据集，在标准与挑战性基准上均取得强劲表现。
Task 2 的提交数量较少，表明在允许不受限的合成数据训练时难度更大。
基于 RFW 的偏差分析显示在所有提交中对高加索群体表现最好，对非洲群体表现较弱，凸显合成数据训练中的偏差。
使用多样化的增强与数据生成策略（如姿态变化、基于质量的选择、混合合成源）对缩小合成-真实差距至关重要。

SDFR: Synthetic Data for Face Recognition Competition

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。