QUICK REVIEW

[论文解读] Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace

Stefanos Zafeiriou, Stefanos Zafeiriou|arXiv (Cornell University)|Jun 10, 2019

Emotion and Mood Recognition被引用 122

一句话总结

作者提出 Aff-Wild2，这是一个在野外环境中的大规模视听数据集，标注情感价/唤醒、面部动作单位（AU）和基础表情，并展示基于多任务与 ArcFace 的学习管线，在多个情感识别数据库上达到最新的结果。

ABSTRACT

Affective computing has been largely limited in terms of available data resources. The need to collect and annotate diverse in-the-wild datasets has become apparent with the rise of deep learning models, as the default approach to address any computer vision task. Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence-arousal estimation, action unit detection and basic expression classification). To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. We call this database Aff-Wild2. We conduct extensive experiments with CNN and CNN-RNN architectures that use visual and audio modalities; these networks are trained on Aff-Wild2 and their performance is then evaluated on 10 publicly available emotion databases. We show that the networks achieve state-of-the-art performance for the emotion recognition tasks. Additionally, we adapt the ArcFace loss function in the emotion recognition context and use it for training two new networks on Aff-Wild2 and then re-train them in a variety of diverse expression recognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recognition models and source code are available at http://ibug.doc.ic.ac.uk/resources/aff-wild2.

研究动机与目标

动机：需要大规模、多样化的在野外环境中标注 VA、AU 和表情的数据集。
通过添加 VA 标注以及 AU/Expr 标注，将 Aff-Wild 扩展为 Aff-Wild2，以实现三者的联合分析。
开发在 Aff-Wild2 上训练并在 10 个外部数据库上评估的多任务 CNN/CNN-RNN 架构（包括音视频融合）。
通过在 Aff-Wild2 上训练基于 ArcFace 的网络并在多个人表达数据库上再训练，研究 ArcFace 损失在情感识别中的有效性。

提出的方法

为视觉（人脸裁剪）和音频（声谱图）模态引入三个预处理流。
训练单任务/多任务 CNN（基于 SphereFace-20、VGGFace、Inception-ResNet），并扩展到多任务 CNN-RNN 以及音视频融合（A/V-MT-VGG-RNN）。
使用多任务学习的标准损失：表达识别使用交叉熵，AU 使用二元交叉熵，VA 使用均方误差/相关相关系数（CCC），并将它们相加作为多任务目标。
将 ArcFace 损失（加性角度边距）用于情感表达识别，生成 MT-ArcRes 和 MT-ArcVGG 网络。
在 Aff-Wild2 上对网络进行预训练，并在 10 个公开数据库上进行评估，以评估跨数据库泛化能力。
提供两个在 Aff-Wild2 上训练并在多个人表达数据库上再训练的基于 ArcFace 的网络，获得改进的最先进结果。

实验结果

研究问题

RQ1Aff-Wild2 是否能够在野外环境下支持 VA、AU 与表达的联合识别？
RQ2在 Aff-Wild2 上训练的多任务 CNN/CNN-RNN 架构是否对其他情感数据库具有良好的泛化能力？
RQ3在野外设置下，音视频融合对 VA、AU 与表达任务是否有帮助？
RQ4将 ArcFace 损失从人脸识别迁移到情感任务时，是否能改善表达识别性能？

主要发现

Aff-Wild2 是首个在野外环境中标注 VA、AUs 和基本表情的大规模视听数据集，能够实现三者的联合分析。
在 Aff-Wild2 上训练的 MT-VGG 与 MT-VGG-RNN 架构在 VA 和 Expr 任务上在 10 个外部情感数据库上达到最先进的性能，音视频融合进一步带来提升。
在 Aff-Wild2 上训练并在各种表达数据库再训练的基于 ArcFace 的网络（MT-ArcRes、MT-ArcVGG）超越了竞争方法，在若干数据集上确立了新的最先进结果。
在静态与视频数据库中的跨数据库评估表明 Aff-Wild2 是强健情感识别模型的丰富预训练资源。
ArcFace 损失在情感识别场景中显示出有效性，表明角度边距方法在面部识别之外的情感任务也有价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。