[论文解读] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach
本文提出一种四模态(场景、面部、音频、文本)多模态融合方法,采用原型增强分类用于视频级别的矛盾/犹豫识别,通过集成达到83.25%平均 MF1和71.43%最终测试 MF1的性能。
Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.
研究动机与目标
- 在不受约束的视频中动机化并解决矛盾/犹豫识别这一微妙的多模态行为状态。
- 开发一个四模态流水线(场景、面部、音频、文本)以学习用于融合的紧凑单模态嵌入。
- 探索基于 transformer 的融合及原型增强目标以建模模态间依赖关系。
- 证明多模态融合在 BAH 数据集上优于单模态基线,并通过集成展示鲁棒性。
提出的方法
- 使用基于 VideoMAE 的视觉模型提取场景动态。
- 通过对 AffectNet 进行微调的 EfficientNetB0 获取逐帧的面部情感嵌入并进行统计池化,再输入至 MLP。
- 利用 EmotionWav2Vec2.0 提取声学情感特征,并通过 Mamba 或 Transformer 编码器进行时序建模,随后进行池化。
- 对文本进行语言线索建模:对转写文本微调整合成的 transformer-based 文本模型(如 EmotionDistilRoBERTa、EmotionTextClassifier 等),获得密集文本嵌入。
- 用基于 Transformer 的多模态模块对单模态嵌入进行融合,使用模态标记和基于原型的分类目标,并对缺失数据引入模态掩码。
- 训练一个两阶段系统:先用各模态的单模态编码器,再进行共享潜在融合,必要时引入原型及多样性正则化的损失。
实验结果
研究问题
- RQ1是否可以通过利用场景、面部、音频和文本的互补信号实现鲁棒的视频级别矛盾/犹豫识别?
- RQ2原型增强的融合是否比标准融合在判别力和泛化能力上有提升?
- RQ3各模态对最终性能的贡献如何,模态融合与单模态基线相比有何差异?
- RQ4在 ABAW10 A/H 挑战的未公开私有测试数据上,集成融合的性能是否鲁棒?
主要发现
| Model Configuration | BAH 子语料 | 模态 | 特征 | 分类器 | Devel. / Valid. (MF1, %) | Test (MF1, %) | Average (MF1, %) | Final test (MF1, %) |
|---|---|---|---|---|---|---|---|---|
| Face | 1 | Face | EmotionEfficientNetB0 + Statistical Features | MLP | 65.29 | 60.05 | 62.67 | – |
| Scene | 2 | Scene | VideoMAE | Linear Layer | 61.71 | 62.21 | 61.96 | – |
| Audio | 3 | Audio | EmotionWav2Vec2.0 + Mamba | Linear Layer | 67.20 | 70.87 | 69.03 | – |
| Text | 4 | Text | TF-IDF | Logistic Regression | 68.30 | 67.75 | 68.03 | – |
| Text | 5 | Text | TF-IDF | CatBoost | 65.56 | 72.02 | 68.79 | – |
| Text | 6 | Text | Fine-tuned EmotionTextClassifier | MLP | 69.28 | 70.72 | 70.00 | – |
| Text | 7 | Text | Fine-tuned EmotionDistilRoBERTa | MLP | 68.54 | 71.49 | 70.02 | – |
| Multimodal | 8 | Models IDs 1, 2, 3 and 4 | Multimodal Fusion Model | Linear Layer | 80.79 | 77.03 | 78.91 | – |
| Multimodal | 9 | Models IDs 1, 2, 3 and 5 | Multimodal Fusion Model | Linear Layer | 77.91 | 78.54 | 78.22 | – |
| Multimodal | 10 | Models IDs 1, 2, 3 and 6 | Multimodal Fusion Model | Linear Layer | 78.35 | 77.03 | 77.69 | – |
| Multimodal | 11 | Models IDs 1, 2, 3 and 7 | Multimodal Fusion Model | Linear Layer | 85.38 | 79.94 | 82.66 | 68.32 |
| Multimodal | 12 | Models IDs 1, 2, 3 and 7 | Multimodal Fusion Model with Prototype Head | Linear Layer | 83.79 | 82.72 | 83.25 | 65.21 |
| Multimodal | 13 | Models IDs 1, 2, 3 and 7 | Ensemble of Five Multimodal Fusion Models | Linear Layer | 81.94 | 80.64 | 81.29 | 70.17 |
| Multimodal | 14 | Models IDs 1, 2, 3 and 7 | Ensemble of Five Multimodal Fusion Models with Prototype Head | Linear Layer | 83.00 | 80.77 | 81.89 | 71.43 |
- 多模态融合在开发集和测试集设置中均优于所有单模态基线。
- 最佳单模态平均 MF1:EmotionDistilRoBERTa 为 70.02%;最佳融合平均 MF1:原型增强四模态模型为 83.25%。
- 最终测试 MF1 峰值由五个原型增强融合模型的集成达到:71.43%。
- 消融实验表明场景与文本的组合贡献最大,且四种模态共同使用时获得最佳整体结果。
- 原型增强融合提供辅助信号以提升最终预测,集成有助于在私有测试集上的泛化能力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。