Skip to main content
QUICK REVIEW

[论文解读] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng, Yutao Chen|arXiv (Cornell University)|Jan 23, 2026
Emotion and Mood Recognition被引用 0
一句话总结

Emotion-LLaMAv2 提出一个端到端的多模态情感理解框架,具有 Conv Attention 预融合模块和感知到认知的课程学习,在统一的 MMEVerse 基准上评估。它实现了最先进的结果并在开源 MLLMs 上具有更好的泛化能力。

ABSTRACT

Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

研究动机与目标

  • 推动鲁棒的多模态情感理解,结合感知与跨音频、视觉、文本信号的语义推理。
  • 消除对外部人脸检测器的依赖,以实现端到端训练并获得更丰富的情感线索。
  • 在语言模型框架内通过课程指令微调统一情感识别与情感推理。
  • 提供一个大规模、标准化的基准(MMEVerse),用于跨数据集与任务的可重复评估。

提出的方法

  • 开发具备多视角视觉编码和音频编码的端到端多模态编码器,以捕捉空间、时间与语调线索。
  • 引入 Conv Attention 预融合模块,在进入大语言模型输入前实现局部与全局跨模态交互。
  • 通过一个模态适配器将融合后的多模态表示对齐到 LLM 空间,以 LoRA 微调的指令跟随及情感任务。
  • 采用感知到认知的课程学习,在 LLaMA2 骨干上逐步从基础情感识别过渡到上下文感知的情感推理。
  • 通过将 12 个数据集聚合为统一的指令微调格式并通过多代理流水线重新标注,构建 MMEVerse,产生 130k 条训练片段和 36k 条测试片段。

实验结果

研究问题

  • RQ1在没有显式人脸检测器的情况下,如何实现端到端的多模态情感理解?
  • RQ2Conv Attention 预融合模块是否能提升跨模态互动以感知情感?
  • RQ3在统一的 LLM 框架中,课程式指令微调是否同时提升情感识别与推理?
  • RQ4像 MMEVerse 这样的大规模、标准化基准是否有效用于在多数据集上训练与评估多模态情感模型?

主要发现

  • Emotion-LLaMAv2 在 MER-UniBench 与 MMEVerse-Bench 上优于代表性的开源 MLLMs。
  • 模型展现出更好的泛化能力和更结构化的多模态推理行为。
  • MMEVerse 提供一个统一、可扩展的资源,有 129k 条训练片段和 36k 条测试片段,覆盖 18 个基准。
  • Emotion-LLaMAv2 与 Qwen2.5 Omni、HumanOmni、AffectGPT 相比具有竞争力或更优的结果。
  • 消融研究显示端到端编码、Conv Attention 融合和感知到认知课程带来收益。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。