QUICK REVIEW

[论文解读] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Wenxuan Zhang, Sharifah Mahani Aljunied|arXiv (Cornell University)|Jun 8, 2023

Topic Modeling被引用 31

一句话总结

介绍 M3Exam，一个真实考试基准，支持多语言、多模态、分级评估，涵盖 9 种语言的 12,317 题，用于评估 LLM；GPT-4 领先，但多语言与多模态性能仍有限。

ABSTRACT

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.

研究动机与目标

动机：需要以人类考试为基础的评估，以捕捉超越特定任务基准的广泛智能技能。
设计一个来自官方考试的多语言、多模态、分级基准，以反映现实世界的认知需求。
提供一个包含丰富上下文信息、图像增强题目及标准化元数据的数据集，以实现对 LLM 的稳健评估。
评估一系列多语言与多模态 LLM，找出在语言、推理及跨模态理解方面的当前优势与不足。

提出的方法

收集来自9种语言和3个教育阶段（小学、初中、高中）的官方考试卷。
应用光学字符识别和语言特定标注，在需要时生成统一的文本为基础的多项选择题格式，并附带背景上下文。
用占位符标记含图像的题目，并保留相应的图像数据以进行多模态评估。
在零样本（及部分少量样本）设置下，使用语言特定的提示对模型进行评估，并对 MCQ 答案进行约束解码。
同时进行文本-only 与多模态模型评估，使用例如 GPT-4、ChatGPT、Claude、BLOOM、Vicuna、BLIP-2、InstructBLIP、Fromage 以及 OpenFlamingo 等模型。

实验结果

研究问题

RQ1多语言 LLM 在跨语言和书写系统的真实考试题上的表现如何，特别是对低资源语言？
RQ2含图像的多模态题在多模态 LLM 中揭示的差距有多大？
RQ3模型的表现模式是否像人类一样随教育水平单调下降，还是呈现不同趋势？
RQ4提示策略（单语、英语言指令、英语言翻译）以及少样本演示对多语言考试题的影响？
RQ5在准确性和跨语言迁移方面，多语言 LLM 与单语言基线相比如何？
RQ6当前基准在捕捉复杂推理、跨模态理解和文化知识方面的局限性是什么？

主要发现

GPT-4 在各语言上表现最强，但在低资源语言和非拉丁字母脚本上仍有挑战。
大多数模型在多语言题目上的准确率低于60%，在非拉丁语言和低资源脚本上明显下降。
多模态模型在复杂多模态题上表现不足，部分单图像模型（如 BLIP-2）对文本仅基线的提升有限。
在教育水平上的表现呈非单调趋势，表明 LLM 智力发展与人类学习轨迹不同。
英语提示策略并不始终提升结果；将题目翻译成英文在某些语言中可显著提高表现。
少样本演示并非普遍提升性能，在某些语言中才有帮助。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。