QUICK REVIEW

[论文解读] Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Bobo Li, Hao Fei|arXiv (Cornell University)|Aug 8, 2023

Sentiment Analysis and Opinion Mining被引用 17

一句话总结

该论文提出 DF-ERC，一个 four-tier 框架，能够联合区分多模态与上下文特征，然后通过 contribution-aware 与 context-refusion 机制进行融合，在 MELD 与 IEMOCAP 上实现了最先进的 MM-ERC 性能。

ABSTRACT

It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

研究动机与目标

在 MM-ERC 中在特征解耦与融合阶段同时建模多模态性与对话上下文以提升性能。
Develop a dual-level disentanglement mechanism to separate modality and utterance information.

提出的方法

使用 RoBERTa 为基础的文本语言模型对整个对话进行编码。
使用 OpenSmile 提取音频特征，使用 DenseNet 在 FER+ 上预训练来提取视觉特征。
Apply Dual-level Disentanglement Mechanism (DDM) to perform modality-level and utterance-level contrastive learning and concatenate raw + disentangled features.
Use Contribution-aware Fusion Mechanism (CFM) to dynamically weight modalities based on true classification probabilities.
Apply Context Refusion Mechanism (CRM) with a prototype-based alignment to decide how much dialogue context to incorporate, using a Bi-LSTM for contextual fusion.
Train with a composite loss: contrastive losses (DDM), TCP-guided fusion loss (CFM), a contextual alignment loss (CRM), a prototype-alignment loss, and standard emotion prediction loss.

实验结果

研究问题

RQ1How can MM-ERC benefit from disentangling features along both modality and utterance dimensions?
RQ2Can dynamic, contribution-aware fusion improve multimodal integration over fixed fusion schemes?
RQ3Does incorporating context fusion via a prototype-alignment-based CRM improve utterance-level emotion prediction?
RQ4Are the proposed mechanisms effective across common MM-ERC benchmarks (MELD, IEMOCAP)?

主要发现

DF-ERC achieves state-of-the-art performance on MELD and IEMOCAP across several metrics.
Both modality- and utterance-level disentanglement (DDM) significantly improve results compared with ablated variants.
Contribution-aware fusion (CFM) and context refusion (CRM) provide substantial gains; removing them degrades performance.
CRM’s context-aware weighting outperforms static full or zero-context baselines, demonstrating the value of adaptive context integration.
Text remains a strong modality, but adding audio and video with the proposed fusion strategies yields notable gains over unimodal baselines.
Ablation analyses confirm the importance of each component (DDM, CFM, CRM) and of modality contribution tuning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。