QUICK REVIEW

[论文解读] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng, Yutao Chen|arXiv (Cornell University)|Jan 23, 2026

Emotion and Mood Recognition被引用 0

一句话总结

Emotion-LLaMAv2 提出一个端到端的多模态情感理解框架，具有 Conv Attention 预融合模块和感知到认知的课程学习，在统一的 MMEVerse 基准上评估。它实现了最先进的结果并在开源 MLLMs 上具有更好的泛化能力。

ABSTRACT

Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

研究动机与目标

推动鲁棒的多模态情感理解，结合感知与跨音频、视觉、文本信号的语义推理。
消除对外部人脸检测器的依赖，以实现端到端训练并获得更丰富的情感线索。
在语言模型框架内通过课程指令微调统一情感识别与情感推理。
提供一个大规模、标准化的基准（MMEVerse），用于跨数据集与任务的可重复评估。

提出的方法

开发具备多视角视觉编码和音频编码的端到端多模态编码器，以捕捉空间、时间与语调线索。
引入 Conv Attention 预融合模块，在进入大语言模型输入前实现局部与全局跨模态交互。
通过一个模态适配器将融合后的多模态表示对齐到 LLM 空间，以 LoRA 微调的指令跟随及情感任务。
采用感知到认知的课程学习，在 LLaMA2 骨干上逐步从基础情感识别过渡到上下文感知的情感推理。
通过将 12 个数据集聚合为统一的指令微调格式并通过多代理流水线重新标注，构建 MMEVerse，产生 130k 条训练片段和 36k 条测试片段。

实验结果

研究问题

RQ1在没有显式人脸检测器的情况下，如何实现端到端的多模态情感理解？
RQ2Conv Attention 预融合模块是否能提升跨模态互动以感知情感？
RQ3在统一的 LLM 框架中，课程式指令微调是否同时提升情感识别与推理？
RQ4像 MMEVerse 这样的大规模、标准化基准是否有效用于在多数据集上训练与评估多模态情感模型？

主要发现

Emotion-LLaMAv2 在 MER-UniBench 与 MMEVerse-Bench 上优于代表性的开源 MLLMs。
模型展现出更好的泛化能力和更结构化的多模态推理行为。
MMEVerse 提供一个统一、可扩展的资源，有 129k 条训练片段和 36k 条测试片段，覆盖 18 个基准。
Emotion-LLaMAv2 与 Qwen2.5 Omni、HumanOmni、AffectGPT 相比具有竞争力或更优的结果。
消融研究显示端到端编码、Conv Attention 融合和感知到认知课程带来收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。