QUICK REVIEW

[论文解读] MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni|arXiv (Cornell University)|Nov 27, 2023

Topic Modeling被引用 15

一句话总结

MMMU 是一个面向大学水平、跨六个学科的多模态基准测试（艺术与设计、商业、科学、健康与医学、人文与社会科学、技术与工程），涵盖 11.5K 个问题，覆盖 30 个学科和 183 个子领域，旨在测试多模态模型的专家级感知、知识与推理能力。它揭示开源大语言模型与 GPT-4V(ision) 存在的巨大差距，仍有很大提升空间。

ABSTRACT

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

研究动机与目标

评估在大学水平科目中的专家级多模态理解与推理能力。
评估当前 LMMs 如何处理多样的图像格式和文本与图像混排输入。
研究开源模型与专有领导者在专家领域任务上的差距。

提出的方法

从大学考试、测验和教科书中人工整理的 11.5K 道多模态题目，覆盖 30 个学科和 183 个子领域。
包含 30 种异构图像类型（图表、示意图、地图、表格、乐谱、化学结构等）以及文本/图像混排输入。
在零样本条件下评估模型，并使用稳健的答案抽取来处理开放式与多选题格式，报告微观平均准确率。
提供与14个开源 LMM 与专有 GPT-4V(ision) 与 Gemini Ultra 的基线比较。
分析错误以将感知、知识和推理失误模式进行类别划分。

实验结果

研究问题

RQ1当前多模态模型在专家领域任务中对多样图像类型的感知与处理能力如何？
RQ2模型在多大程度上能够将大学水平的领域知识应用于解决文本与图像混排的问题？
RQ3在 MMMU 的跨学科任务中，开源 LMM 与专有领导者之间存在哪些性能差距？
RQ4在专家级多模态任务中，感知、知识与推理的主要错误类别是什么？

主要发现

模型	简单	中等	困难	总体
Fuyu-8B	27.4	27.0	26.4	27.4
Qwen-VL-7B	32.9	31.9	27.6	32.9
LLaVA-1.5-13B	33.6	32.7	26.7	33.6
InstructBLIP-T5-XXL	33.8	32.3	29.4	33.8
BLIP-2 FLAN-T5-XXL	34.0	32.7	28.5	34.0
GPT-4V	76.1	55.6	31.2	55.7

GPT-4V(ision) 的整体准确率为 55.7%，这表明在 MMMU 上仍有 substantial 提升空间。
顶尖的开源模型（如 BLIP2-FLAN-T5-XXL、LLaVA-1.5）整体准确率约为 34%，显示与 GPT-4V 的巨大战差。
OCR 或字幕辅助对 MMMU 的提升有限，表明需要更深层次的图文联合解释。
视觉数据较为简单的学科（Art & Design, Humanities & Social Science）表现高于需复杂可视化和领域特定推理的学科（Science, Health & Medicine, Tech & Engineering）。
对 GPT-4V 在 150 个案例的错误分析显示 35% 为感知错误、29% 为知识缺口、26% 为推理缺陷，凸显 MMMU 的多方面挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。