[论文解读] Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition
论文将 Aff-Wild 数据集扩展为 Aff-Wild2,包含 458 名受试者和 2.8M 帧,并提出 CNN–RNN–attention 架构用于连续 valence-arousal 预测,在跨数据库迁移到 RECOLA 上表现出色。
Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.
研究动机与目标
- Extend the Aff-Wild database by increasing variability and size to Aff-Wild2 (more subjects, more frames, diverse conditions).
- Develop end-to-end deep architectures (CNN–RNN with attention) for continuous valence-arousal estimation in the wild.
- Evaluate cross-database generalization by fine-tuning on RECOLA and comparing to state-of-the-art models.
- Analyze how pre-training on large face datasets affects affect prediction performance.
提出的方法
- Construct Aff-Wild2 by adding 260 videos (1,413,000 frames) to Aff-Wild, totaling 558 videos and 2,786,201 frames across 458 subjects.
- Annotate valence and arousal with continuous time stamps via four experts and perform annotation post-processing to obtain MAIC-based final labels.
- Detect faces in frames and normalize to 96×96×3 inputs for CNNs.
- Experiment with CNN backbones (VGGFACE, VGGFACE2, DenseNet-121; pre-trained on corresponding datasets) and RNN variants (LSTM, GRU, indRNN) with 2 hidden RNN layers of 128 units.
- Incorporate an attention layer on top of the RNNs and train with a loss L_total = 1 - (ρ_a + ρ_v)/2, where ρ_a and ρ_v are CCCs for arousal and valence.
- Evaluate architectures on Aff-Wild2 using Concordance Correlation Coefficient (CCC) as the performance metric, with frame-based training details (Adam optimizer, 320 batch size, attention length 32).
实验结果
研究问题
- RQ1Can Aff-Wild2 improve robustness and coverage of spontaneous affective expressions in-the-wild compared to Aff-Wild?
- RQ2What CNN–RNN–attention configurations yield the best valence-arousal predictions on Aff-Wild2?
- RQ3Do models trained on Aff-Wild2 generalize to other datasets (e.g., RECOLA) after fine-tuning?
- RQ4How does pre-training on large face datasets (VGGFACE/VGGFACE2) affect affect prediction performance?
主要发现
- Aff-Wild2 comprises 558 videos, 2,786,201 frames, and 458 subjects (279 male, 179 female).
- The best performing architecture is VGGFace-GRU-attention, achieving valence CCC 0.55 and arousal CCC 0.45 on the test set (validation CCCs: 0.58 and 0.48 respectively).
- Fine-tuning the best Aff-Wild2 model (VGGFACE1-GRU-attention) on RECOLA yields CCCs of 0.547 (valence) and 0.304 (arousal), outperforming ResNet-GRU and AffWildNet baselines on RECOLA.
- Attention-enhanced CNN–RNN models consistently improve CCC over non-attention variants across configurations.
- Cross-database transfer shows strong improvements when models pre-trained on Aff-Wild2 are adapted to RECOLA, indicating good generalization of the proposed approach.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。