Skip to main content
QUICK REVIEW

[论文解读] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina, Alexandr Axyonov|arXiv (Cornell University)|Mar 13, 2026
Emotion and Mood Recognition被引用 0
一句话总结

本文提出一种四模态(场景、面部、音频、文本)多模态融合方法,采用原型增强分类用于视频级别的矛盾/犹豫识别,通过集成达到83.25%平均 MF1和71.43%最终测试 MF1的性能。

ABSTRACT

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

研究动机与目标

  • 在不受约束的视频中动机化并解决矛盾/犹豫识别这一微妙的多模态行为状态。
  • 开发一个四模态流水线(场景、面部、音频、文本)以学习用于融合的紧凑单模态嵌入。
  • 探索基于 transformer 的融合及原型增强目标以建模模态间依赖关系。
  • 证明多模态融合在 BAH 数据集上优于单模态基线,并通过集成展示鲁棒性。

提出的方法

  • 使用基于 VideoMAE 的视觉模型提取场景动态。
  • 通过对 AffectNet 进行微调的 EfficientNetB0 获取逐帧的面部情感嵌入并进行统计池化,再输入至 MLP。
  • 利用 EmotionWav2Vec2.0 提取声学情感特征,并通过 Mamba 或 Transformer 编码器进行时序建模,随后进行池化。
  • 对文本进行语言线索建模:对转写文本微调整合成的 transformer-based 文本模型(如 EmotionDistilRoBERTa、EmotionTextClassifier 等),获得密集文本嵌入。
  • 用基于 Transformer 的多模态模块对单模态嵌入进行融合,使用模态标记和基于原型的分类目标,并对缺失数据引入模态掩码。
  • 训练一个两阶段系统:先用各模态的单模态编码器,再进行共享潜在融合,必要时引入原型及多样性正则化的损失。

实验结果

研究问题

  • RQ1是否可以通过利用场景、面部、音频和文本的互补信号实现鲁棒的视频级别矛盾/犹豫识别?
  • RQ2原型增强的融合是否比标准融合在判别力和泛化能力上有提升?
  • RQ3各模态对最终性能的贡献如何,模态融合与单模态基线相比有何差异?
  • RQ4在 ABAW10 A/H 挑战的未公开私有测试数据上,集成融合的性能是否鲁棒?

主要发现

Model ConfigurationBAH 子语料模态特征分类器Devel. / Valid. (MF1, %)Test (MF1, %)Average (MF1, %)Final test (MF1, %)
Face1FaceEmotionEfficientNetB0 + Statistical FeaturesMLP65.2960.0562.67
Scene2SceneVideoMAELinear Layer61.7162.2161.96
Audio3AudioEmotionWav2Vec2.0 + MambaLinear Layer67.2070.8769.03
Text4TextTF-IDFLogistic Regression68.3067.7568.03
Text5TextTF-IDFCatBoost65.5672.0268.79
Text6TextFine-tuned EmotionTextClassifierMLP69.2870.7270.00
Text7TextFine-tuned EmotionDistilRoBERTaMLP68.5471.4970.02
Multimodal8Models IDs 1, 2, 3 and 4Multimodal Fusion ModelLinear Layer80.7977.0378.91
Multimodal9Models IDs 1, 2, 3 and 5Multimodal Fusion ModelLinear Layer77.9178.5478.22
Multimodal10Models IDs 1, 2, 3 and 6Multimodal Fusion ModelLinear Layer78.3577.0377.69
Multimodal11Models IDs 1, 2, 3 and 7Multimodal Fusion ModelLinear Layer85.3879.9482.6668.32
Multimodal12Models IDs 1, 2, 3 and 7Multimodal Fusion Model with Prototype HeadLinear Layer83.7982.7283.2565.21
Multimodal13Models IDs 1, 2, 3 and 7Ensemble of Five Multimodal Fusion ModelsLinear Layer81.9480.6481.2970.17
Multimodal14Models IDs 1, 2, 3 and 7Ensemble of Five Multimodal Fusion Models with Prototype HeadLinear Layer83.0080.7781.8971.43
  • 多模态融合在开发集和测试集设置中均优于所有单模态基线。
  • 最佳单模态平均 MF1:EmotionDistilRoBERTa 为 70.02%;最佳融合平均 MF1:原型增强四模态模型为 83.25%。
  • 最终测试 MF1 峰值由五个原型增强融合模型的集成达到:71.43%。
  • 消融实验表明场景与文本的组合贡献最大,且四种模态共同使用时获得最佳整体结果。
  • 原型增强融合提供辅助信号以提升最终预测,集成有助于在私有测试集上的泛化能力。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。