QUICK REVIEW

[论文解读] Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Ting Wu, Junjie Peng|arXiv (Cornell University)|Mar 3, 2021

Emotion and Mood Recognition参考文献 66被引用 119

一句话总结

本文提出了一种新型多模态融合框架——双模态信息增强多头注意力（Bimodal Information-augmented Multi-Head Attention, BIMHA），用于视频情感分析。该方法通过注意力机制在文本、视觉和听觉模态之间建模双模态交互关系。通过为听觉-视觉、听觉-文本和视觉-文本特征对分配动态注意力权重，并利用残差结构进行融合，BIMHA在四个公开数据集上均提升了情感预测准确率，在MOSI、MOSEI和IEMOCAP数据集上取得了当前最优性能。

ABSTRACT

Humans express feelings or emotions via different channels. Take language as an example, it entails different sentiments under different visual-acoustic contexts. To precisely understand human intentions as well as reduce the misunderstandings caused by ambiguity and sarcasm, we should consider multimodal signals including textual, visual and acoustic signals. The crucial challenge is to fuse different modalities of features for sentiment analysis. To effectively fuse the information carried by different modalities and better predict the sentiments, we design a novel multi-head attention based fusion network, which is inspired by the observations that the interactions between any two pair-wise modalities are different and they do not equally contribute to the final sentiment prediction. By assigning the acoustic-visual, acoustic-textual and visual-textual features with reasonable attention and exploiting a residual structure, we attend to attain the significant features. We conduct extensive experiments on four public multimodal datasets including one in Chinese and three in English. The results show that our approach outperforms the existing methods and can explain the contributions of bimodal interaction in multiple modalities.

研究动机与目标

为解决多模态情感融合中不同模态对之间对情感预测贡献不均的问题。
同时建模模态内、模态间及双模态交互关系，以获得更丰富的特征表示。
通过扩展的多头注意力机制动态加权双模态交互，提升情感预测性能。
通过注意力可视化，提供不同双模态组合（AV、AT、VT）在样本层面贡献情感决策的可解释性。

提出的方法

提出双模态多头注意力（BMHA），作为多头注意力的扩展，将多模态特征作为查询，双模态特征作为键和值。
在应用BMHA前，使用张量融合生成双模态交互特征，以实现动态注意力加权。
采用残差连接，在融合注意力增强的双模态特征的同时保留原始的模态间特征。
使用三个并行的MHA头分别处理听觉-视觉、听觉-文本和视觉-文本的交互，每个头学习特定模态的注意力模式。
将加权后的双模态特征整合进情感推理网络，完成最终预测。
通过可视化注意力分数，实现实时决策中各双模态对贡献的解释。

实验结果

研究问题

RQ1不同双模态交互（AV、AT、VT）在视频样本中对情感预测的贡献是否存在差异？
RQ2多头注意力机制是否能有效建模双模态交互关系，同时保留单模态和模态间表示？
RQ3与固定融合策略相比，对双模态特征对实施动态注意力加权是否能提升情感分类性能？
RQ4所提模型在多大程度上可通过注意力可视化解释其预测结果？
RQ5该模型在多样化的多模态数据集上，包括中文视频情感分析等低资源场景下，泛化能力如何？

主要发现

在CMU-MOSI数据集上，BIMHA达到SOTA性能，测试准确率为83.44%（Acc-2），在‘负面/非负面’分类任务中F1得分为85.46%。
在MOSEI数据集上，BIMHA在‘负面/正面’情感分类任务中取得83.19%准确率（Acc-2）和83.21% F1得分，优于先前方法。
在IEMOCAP数据集上，BIMHA在‘高兴’类别中达到86.57%准确率和85.8% F1得分，展现出在各类情绪类别中的强大性能。
注意力可视化显示，VT（视觉-文本）特征在数据集中贡献最为稳定，而AV（听觉-视觉）特征在特定样本中占主导地位。
引入未对齐数据训练后，模型性能进一步提升，在Self-MM的未对齐设置下达到53.87% Acc-2和0.765 Corr，表明对数据分布偏移具有鲁棒性。
消融实验验证了引入双模态注意力对性能的显著提升，其中BIMHA2（统一注意力）在多个指标上优于BIMHA1（对齐）

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。