QUICK REVIEW

[논문 리뷰] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Wenxuan Zhang, Sharifah Mahani Aljunied|arXiv (Cornell University)|2023. 06. 08.

Topic Modeling인용 수 31

한 줄 요약

M3Exam을 도입, 9개 언어에 걸친 12,317개의 문제를 포함하는 실제 시험 기반의 다국어, 멀티모달, 다층 벤치마크로 LLM을 평가; GPT-4가 선두를 차지하지만 다국어 및 멀티모달 성능은 여전히 제한적이다.

ABSTRACT

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.

연구 동기 및 목표

Task-specific 벤치마크를 넘어서는 광범위한 지능 기술을 포착하기 위해 인간 시험 기반 평가의 필요성을 촉진한다.
공식 시험에서 가져온 다국어, 멀티모달, 다층 벤치마크를 설계하여 실제 세계의 인지 요구를 반영한다.
강력한 LLM 평가를 위해 풍부한 맥락 정보, 이미지 보강 항목, 표준화된 메타데이터를 제공하는 데이터셋을 제공한다.
언어, 추론 및 교차 모달 이해의 현재 강점과 격차를 식별하기 위해 여러 다국어 및 멀티모달 LLM을 평가한다.

제안 방법

9개 언어 및 3개 교육 수준(초등, 중등, 고등)에서 공식 시험지를 수집한다.
OCR과 언어별 주석을 적용하여 맥락 배경이 필요한 경우를 포함한 일관된 텍스트 기반 다지선다형 형식을 생성한다.
이미지 포함 문제를 자리 표시자로 표시하고 다중 모달 평가를 위해 해당 이미지 데이터를 보존한다.
언어별 프롬프트를 사용하고 제약 디코딩을 통해 MCQ 정답을 도출하는 제로샷(및 일부 파샷) 설정으로 모델을 평가한다.
텍스트 전용과 멀티모달 모델 평가를 모두 포함하며, GPT-4, ChatGPT, Claude, BLOOM, Vicuna, BLIP-2, InstructBLIP, Fromage, OpenFlamingo와 같은 모델을 사용한다.

실험 결과

연구 질문

RQ1다언어 및 문자 체계에 걸친 실제 시험 문제에서 다국어 LLM의 성능은 특히 자원이 부족한 언어에서 얼마나 우수한가?
RQ2이미지와 함께하는 다중 모달 문제는 현재 다중 모달 LLM의 격차를 얼마나 드러내는가?
RQ3모델의 성능 패턴이 인간과 같이 교육 수준에 따라 단조롭게 감소하는가, 아니면 다른 경향을 보이는가?
RQ4프롬프트 전략(단일 언어 vs 영어 지시 vs 영어 번역)과 파샷 시연이 다국어 시험 문제에 미치는 영향은 무엇인가?
RQ5다국어 LLM이 정확도와 언어 간 전이 측면에서 모노링궤al 기준선과 비교해 어떤 차이가 있는가?
RQ6복잡한 추론, 교차 모달 이해 및 문화적 지식을 포착하는 현재 벤치마크의 한계는 무엇인가?

주요 결과

GPT-4는 모든 언어에서 가장 강한 성능을 보이나 여전히 자원이 부족한 언어와 비라틴 문자 스크립트에서 어려움을 겪는다.
다국어 문제에서 대부분의 모델이 60% 미만의 정확도를 기록하며, 비라틴 언어 및 자원이 부족한 스크립트에서 유의미한 하락이 나타난다.
멀티모달 모델은 복잡한 멀티모달 문제에서 낮은 성능을 보이며, 일부 단일 이미지 모델(BLIP-2 등)은 텍스트 전용 베이스라인에 비해 미미한 이득만을 제공한다.
교육 수준에 따른 비단조적 성능 추세는 LLM 지능 발달이 인간의 학습 경로와 다름을 시사한다.
영어 프롬프트 전략은 일관되게 결과를 향상시키지 않으며, 일부 언어에서 영어로의 번역이 성능을 크게 높일 수 있다.
파샷 시연은 보편적으로 성능을 개선하지 않으며 일부 언어에서만 도움이 된다。）

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.