[论文解读] Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework
本文提出 AffWildNet 及一个统一的多任务框架用于野外情感分析,覆盖 valence/arousal、表达与动作单元,使用 Aff-Wild 与 Aff-Wild2 数据库,具有广泛的多任务与多组件架构及基准结果。
Affect recognition based on subjects' facial expressions has been a topic of major research in the attempt to generate machines that can understand the way subjects feel, act and react. In the past, due to the unavailability of large amounts of data captured in real-life situations, research has mainly focused on controlled environments. However, recently, social media and platforms have been widely used. Moreover, deep learning has emerged as a means to solve visual analysis and recognition problems. This paper exploits these advances and presents significant contributions for affect analysis and recognition in-the-wild. Affect analysis and recognition can be seen as a dual knowledge generation problem, involving: i) creation of new, large and rich in-the-wild databases and ii) design and training of novel deep neural architectures that are able to analyse affect over these databases and to successfully generalise their performance on other datasets. The paper focuses on large in-the-wild databases, i.e., Aff-Wild and Aff-Wild2 and presents the design of two classes of deep neural networks trained with these databases. The first class refers to uni-task affect recognition, focusing on prediction of the valence and arousal dimensional variables. The second class refers to estimation of all main behavior tasks, i.e. valence-arousal prediction; categorical emotion classification in seven basic facial expressions; facial Action Unit detection. A novel multi-task and holistic framework is presented which is able to jointly learn and effectively generalize and perform affect recognition over all existing in-the-wild databases. Large experimental studies illustrate the achieved performance improvement over the existing state-of-the-art in affect recognition.
研究动机与目标
- Motivate robust affect recognition in unconstrained real-world settings using large-scale in-the-wild datasets.
- Develop deep learning architectures that can jointly model dimensional (valence/arousal), categorical expressions, and action units.
- Create and leverage large in-the-wild databases (Aff-Wild, Aff-Wild2) to train and generalize affect recognition systems.
提出的方法
- 提出将 CNN 特征与时序模型(RNNs/GRUs)融合的单任务(维度)与多任务整体架构。
- 引入 AffWildNet,一个在 Aff-Wild 上进行 valence/arousal 估计的端到端 CNN-RNN 网络,使用基于 CCC 的损失函数(L_total = 1 - 0.5*(rho_a + rho_v))。
- 通过多组件 CNN+多 RNN 的设计(CNN-3RNN 和 CNN-1RNN)来提升架构,这些设计在独立的 RNN 中利用低级、中级和高级 CNN 特征并进行融合。
- 通过将 68 个面部关键点特征与 CNN 特征连接,进行基于关键点的特征增强,以改进时序建模。
- 探索模型级与决策级集成融合以提升 valence/arousal 的预测,配合后处理(中值滤波、平滑)。
- 在 Aff-Wild2 上对架构进行预训练,并针对 OMG-Emotion 数据集的说话单元级注释特性进行自适应。
实验结果
研究问题
- RQ1如何利用大规模野外情感数据集(Aff-Wild、Aff-Wild2)来改进 valence-arousal 估计、表达与动作单元检测?
- RQ2一个统一的多任务框架在维度、类别和 AU 表征上 jointly learning 能否在野外场景超越单任务模型?
- RQ3将多层 CNN 特征与关键点融合的多组件 CNN+RNN 架构是否能带来更优的时序情感估计?
- RQ4模型级与决策级融合对野外 valence/arousal 预测的准确性有何影响?
- RQ5在 Aff-Wild2 预训练并端到端训练的多组件网络是否能显著提升野外数据集上的性能,并迁移到相关任务(如 OMG-Emotion)?
主要发现
| 模型 | Valence CCC | Arousal CCC | Mean CCC | Valence MSE | Arousal MSE | Mean MSE |
|---|---|---|---|---|---|---|
| FATAUVA-Net | 0.40 | 0.28 | 0.34 | 0.12 | 0.10 | 0.11 |
| VGG-16 | 0.40 | 0.30 | 0.35 | 0.13 | 0.11 | 0.12 |
| ResNet-50 | 0.43 | 0.30 | 0.37 | 0.11 | 0.11 | 0.11 |
| VGG-FACE | 0.51 | 0.33 | 0.42 | 0.10 | 0.08 | 0.09 |
| VGG-FACE-LSTM | 0.52 | 0.38 | 0.45 | 0.10 | 0.09 | 0.10 |
| AffWildNet | 0.57 | 0.43 | 0.50 | 0.08 | 0.06 | 0.07 |
- AffWildNet 在所评估的架构中实现了 valence/arousal 的最先进 CCC 分数(AffWildNet 优于 FATAUVA-Net 及其他基线)。
- 利用多层 CNN 特征并在独立的 RNN 中进行融合的多组件 CNN+RNN 架构(CNN-3RNN、CNN-1RNN)相比单一 RNN 的方法能提升维度情感估计的表现。
- 带有基于 RNN 的融合模块的模型级融合在 valence/arousal 表现上优于决策级融合或基于 FC 的融合。
- 在 Aff-Wild2 上进行预训练并对多组件网络进行端到端训练显著提升在野外数据集上的性能,并能迁移到相关任务(OMG-Emotion)。
- Aff-Wild2 提供了对 valence/arousal、AUs 与基础表达的全面注释,覆盖 558 个视频和 458 名主体,使野外学习更为稳健。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。