QUICK REVIEW

[论文解读] Explainable Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun|arXiv (Cornell University)|Jun 27, 2023

Sentiment Analysis and Opinion Mining被引用 8

一句话总结

本论文提出 Explainable Multimodal Emotion Reasoning (EMER)，一种在带有解释的情感预测任务、新数据集、基线多模态大语言模型、评估指标，以及用于情感计算的多模态大语言模型 AffectGPT。

ABSTRACT

Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow large language models (LLMs) to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.

研究动机与目标

解决多模态情感识别中由于主观情感导致的标签模糊性。
引入 EMER 以为预测的情感提供解释，而不仅仅是预测结果。
创建初始的 EMER 数据集和带评估指标的基线模型。
提出一个用于 EMER 的多模态大模型 AffectGPT。
为在多模态大模型中评估音视频文本理解奠定基础。

提出的方法

将 EMER 定义为需要对情感预测背后进行合理推理的任务。
从 MER2023 构建初始的 EMER 数据集，包含线索和情感标注。
开发基线，基于能够处理视频输入的多模态大语言模型（VideoChat、Video-LLaMA、PandaGPT、Valley）。
在提示中加入字幕和音频，以评估多模态推理能力。
用自动评估（基于 ChatGPT）和人工评估对线索与情感重叠度及推理完整性进行评估。
引入 AffectGPT，这是一种在 EMER 数据上训练的多模态大模型，用于增强情感推理。

实验结果

研究问题

RQ1EMER 能否通过可解释的推理可靠地提高情感标签的标注质量和可靠性？
RQ2当前的多模态大模型在跨视觉、音频和文本模态的可解释情感推理上表现如何？
RQ3用 EMER 数据进行指令微调是否能改善情感推理和多模态理解？
RQ4为 EMER 任务专门设计的多模态情感模型（AffectGPT）的附加价值是什么？

主要发现

当前的多模态大模型在情感推理方面存在困难，与真值在线索和标签重叠方面存在较大差距。
AffectGPT 在线索和标签重叠及人工评估方面在基线中取得最高分。
将多个基线进行集成可以提升情感推理性能，相较于单一模型。
更长的视频通常能产生更丰富的情感相关描述和更高的模态完整性。
以视频为中心的基线往往忽视音频线索，凸显需要更丰富的音频指令数据集。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。