QUICK REVIEW

[논문 리뷰] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng, Yutao Chen|arXiv (Cornell University)|2026. 01. 23.

Emotion and Mood Recognition인용 수 0

한 줄 요약

Emotion-LLaMAv2는 Conv Attention 전융합 모듈과 지각-인지 커리큘럼으로 엔드-투-엔드 다중모달 정서 이해 프레임워크를 제안하고, 통합된 MMEVerse 벤치마크에서 평가된다. 이는 최신 성과를 달성하고 오픈 소스 MLLMs에 비해 일반화 능력이 더 우수하다.

ABSTRACT

Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

연구 동기 및 목표

오디오, 비주얼, 텍스트 신호를 아우르는 지각과 의미 추론을 결합한 강건한 다중모달 정서 이해를 촉진한다.
외부 얼굴 검출기에 대한 의존성을 제거하여 엔드-투-엔드 학습과 더 풍부한 정서 단서를 가능하게 한다.
커리큘럼 지시 학습을 통해 언어 모델 프레임워크 내에서 정서 인식과 정서 추론을 통합한다.
다양한 데이터셋과 작업에 걸친 재현 가능한 평가를 위한 대규모 표준화 벤치마크(MMEVerse)를 제공한다.

제안 방법

공간적, 시간적, 프로소딕 단서를 포착하기 위해 다중 시각 뷰를 가진 시각 인코더와 오디오 인코더를 포함한 엔드-투-엔드 다중모달 인코더를 개발한다.
LLM 입력 전에 국소적/전역 교차 모달 상호작용을 동시에 가능하게 하는 Conv Attention 전융합 모듈을 도입한다.
융합된 다중모달 표현을 LLM 공간으로 정렬하기 위해 모달 어댑터를 사용하여 LoRA-튜닝된 지시 수행을 정서 작업에 맞춘다.
LLaMA2 백본 내에서 기본 정서 인식에서 맥락 인식 정서 추론까지의 학습을 단계적으로 진행하는 지각-인식 커리큘럼을 적용한다.
12개 데이터셋을 하나의 일관된 지시-튜닝 포맷으로 집계하고 다중 에이전트 파이프라인으로 재주석하여 130k 개의 학습 클립과 36k 개의 테스트 클립을 생성하여 MMEVerse를 구축한다.

실험 결과

연구 질문

RQ1명시적 얼굴 검출기 없이도 엔드-투-엔드 다중모달 정서 이해를 달성할 수 있는가?
RQ2Conv Attention 전융합 모듈이 정서 인식을 위한 교차 모달 상호작용을 개선할 수 있는가?
RQ3커리큘럼 기반의 지시 학습이 통합된 LLM 프레임워크에서 정서 인식과 추론을 모두 향상시키는가?
RQ4MMEVerse와 같은 대규모 표준화 벤치마크가 다양한 데이터셋에 걸친 다중모달 정서 모델의 학습 및 평가에 효과적인가?

주요 결과

Emotion-LLaMAv2는 MER-UniBench 및 MMEVerse-Bench에서 대표적인 오픈소스 MLLMs를 능가한다.
모델은 일반화가 향상되고 더 구조화된 다중모달 추론 행동을 보인다.
MMEVerse는 18개 벤치마크에 걸쳐 129k 학습 클립과 36k 테스트 클립으로 통합되고 확장 가능한 자원을 제공한다.
Emotion-LLaMAv2는 Qwen2.5 Omni, HumanOmni, AffectGPT와 비교해 경쟁력 있거나 우수한 결과를 달성한다.
분해 실험은 엔드-투-엔드 인코딩, Conv Attention 융합, 지각-인지 커리큘럼의 이점을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.