QUICK REVIEW

[论文解读] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina, Alexandr Axyonov|arXiv (Cornell University)|Mar 13, 2026

Emotion and Mood Recognition被引用 0

一句话总结

本文提出一种四模态（场景、面部、音频、文本）多模态融合方法，采用原型增强分类用于视频级别的矛盾/犹豫识别，通过集成达到83.25%平均 MF1和71.43%最终测试 MF1的性能。

ABSTRACT

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

研究动机与目标

在不受约束的视频中动机化并解决矛盾/犹豫识别这一微妙的多模态行为状态。
开发一个四模态流水线（场景、面部、音频、文本）以学习用于融合的紧凑单模态嵌入。
探索基于 transformer 的融合及原型增强目标以建模模态间依赖关系。
证明多模态融合在 BAH 数据集上优于单模态基线，并通过集成展示鲁棒性。

提出的方法

使用基于 VideoMAE 的视觉模型提取场景动态。
通过对 AffectNet 进行微调的 EfficientNetB0 获取逐帧的面部情感嵌入并进行统计池化，再输入至 MLP。
利用 EmotionWav2Vec2.0 提取声学情感特征，并通过 Mamba 或 Transformer 编码器进行时序建模，随后进行池化。
对文本进行语言线索建模：对转写文本微调整合成的 transformer-based 文本模型（如 EmotionDistilRoBERTa、EmotionTextClassifier 等），获得密集文本嵌入。
用基于 Transformer 的多模态模块对单模态嵌入进行融合，使用模态标记和基于原型的分类目标，并对缺失数据引入模态掩码。
训练一个两阶段系统：先用各模态的单模态编码器，再进行共享潜在融合，必要时引入原型及多样性正则化的损失。

实验结果

研究问题

RQ1是否可以通过利用场景、面部、音频和文本的互补信号实现鲁棒的视频级别矛盾/犹豫识别？
RQ2原型增强的融合是否比标准融合在判别力和泛化能力上有提升？
RQ3各模态对最终性能的贡献如何，模态融合与单模态基线相比有何差异？
RQ4在 ABAW10 A/H 挑战的未公开私有测试数据上，集成融合的性能是否鲁棒？

主要发现

Model Configuration	BAH 子语料	模态	特征	分类器	Devel. / Valid. (MF1, %)	Test (MF1, %）	Average (MF1, %)	Final test (MF1, %)
Face	1	Face	EmotionEfficientNetB0 + Statistical Features	MLP	65.29	60.05	62.67	–
Scene	2	Scene	VideoMAE	Linear Layer	61.71	62.21	61.96	–
Audio	3	Audio	EmotionWav2Vec2.0 + Mamba	Linear Layer	67.20	70.87	69.03	–
Text	4	Text	TF-IDF	Logistic Regression	68.30	67.75	68.03	–
Text	5	Text	TF-IDF	CatBoost	65.56	72.02	68.79	–
Text	6	Text	Fine-tuned EmotionTextClassifier	MLP	69.28	70.72	70.00	–
Text	7	Text	Fine-tuned EmotionDistilRoBERTa	MLP	68.54	71.49	70.02	–
Multimodal	8	Models IDs 1, 2, 3 and 4	Multimodal Fusion Model	Linear Layer	80.79	77.03	78.91	–
Multimodal	9	Models IDs 1, 2, 3 and 5	Multimodal Fusion Model	Linear Layer	77.91	78.54	78.22	–
Multimodal	10	Models IDs 1, 2, 3 and 6	Multimodal Fusion Model	Linear Layer	78.35	77.03	77.69	–
Multimodal	11	Models IDs 1, 2, 3 and 7	Multimodal Fusion Model	Linear Layer	85.38	79.94	82.66	68.32
Multimodal	12	Models IDs 1, 2, 3 and 7	Multimodal Fusion Model with Prototype Head	Linear Layer	83.79	82.72	83.25	65.21
Multimodal	13	Models IDs 1, 2, 3 and 7	Ensemble of Five Multimodal Fusion Models	Linear Layer	81.94	80.64	81.29	70.17
Multimodal	14	Models IDs 1, 2, 3 and 7	Ensemble of Five Multimodal Fusion Models with Prototype Head	Linear Layer	83.00	80.77	81.89	71.43

多模态融合在开发集和测试集设置中均优于所有单模态基线。
最佳单模态平均 MF1：EmotionDistilRoBERTa 为 70.02%；最佳融合平均 MF1：原型增强四模态模型为 83.25%。
最终测试 MF1 峰值由五个原型增强融合模型的集成达到：71.43%。
消融实验表明场景与文本的组合贡献最大，且四种模态共同使用时获得最佳整体结果。
原型增强融合提供辅助信号以提升最终预测，集成有助于在私有测试集上的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。