QUICK REVIEW

[论文解读] iQIYI-VID: A Large Dataset for Multi-modal Person Identification

Yuanliu Liu, Bo Peng|arXiv (Cornell University)|Nov 19, 2018

Face recognition and analysis参考文献 79被引用 25

一句话总结

本文介绍了iQIYI-VID，这是目前最大的用于多模态人物识别的视频数据集，包含来自多样化在线视频的5,000位名人共60万段视频剪辑。该研究提出了一种多模态注意力（MMA）模块，可自适应地融合面部、头部、身体和音频特征，在基准数据集上将人物识别准确率相比单模态基线方法提升了2.61%，最终实现87.80%的MAP。

ABSTRACT

Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2\%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for the task of person identification in the wild. We proposed a Multi-modal Attention module to fuse multi-modal features that can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.

研究动机与目标

解决单模态人物识别方法（如面部、音频、Re-ID）在非约束真实视频场景下的局限性。
构建一个大规模、高质量的视频数据集，以支持多模态人物识别研究。
开发一种可学习的特征融合机制，基于模态间的相关性自适应地结合多模态特征。
在具有挑战性的、真实世界的基准上评估最先进模型，并证明多模态融合的必要性。

提出的方法

iQIYI-VID数据集由400,000小时的多样化在线视频（包括电影、电视剧、新闻）构建而成，包含5,000位名人的60万段视频剪辑。
所有剪辑均经过人工标注，错误率低于0.2%，确保了标签质量，适用于基准测试。
多模态注意力（MMA）模块基于模态间的相互关联性，学习面部、头部、身体和音频特征的注意力权重。
MMA模块通过动态重加权特征贡献，抑制不一致或不可靠的特征（如被遮挡的面部、非说话音频）。
基线模型使用ArcFace提取面部特征，采用NetVLAD进行中级特征聚合，通过平均池化实现帧级特征融合。
采用集成策略，结合在不同数据划分上训练的模型，通过概率平均提升性能。

实验结果

研究问题

RQ1与单模态方法相比，多模态特征融合在非约束视频环境中如何提升人物识别性能？
RQ2在真实世界视频剪辑中，面部、头部、身体和音频特征对人物识别的相对贡献如何？
RQ3可学习的注意力机制是否能有效抑制融合过程中噪声或不一致的模态特征？
RQ4所提出的多模态注意力模块与传统融合方法（如平均池化或拼接）相比表现如何？
RQ5iQIYI-VID数据集在多大程度上对现有最先进的人物识别模型构成挑战？

主要发现

仅使用面部识别的模型在iQIYI-VID上的MAP为85.19%，远低于其在LFW数据集上99.83%的性能，表明该数据集具有更高的现实世界复杂性。
仅使用音频的模型表现较差，MAP仅为11.79%，主要由于存在非说话剪辑以及配音演员不匹配的问题。
身体特征表现不佳，主要由于服装变化导致的类内差异过大，以及制服相似性问题。
将四种模态（面部、头部、身体、音频）全部结合后，MAP提升了2.61个百分点，达到87.80%，证明了多模态融合的价值。
多模态注意力（MMA）模块相比标准融合方法进一步提升了0.24个百分点，证明其在抑制不可靠特征方面的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。