QUICK REVIEW

[論文レビュー] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Wenxuan Zhang, Sharifah Mahani Aljunied|arXiv (Cornell University)|Jun 8, 2023

Topic Modeling被引用数 31

ひとこと要約

LLMsを評価するための、実在の試験を基にした多言語・多モーダル・多層級ベンチマーク「M3Exam」を紹介。9言語にまたがる12,317問。GPT-4がリードするものの、多言語・多模態の性能は依然として限定的。

ABSTRACT

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.

研究の動機と目的

タスク固有のベンチマークを超えた広範な知性スキルを捉えるために、人間の試験に基づく評価の必要性を動機づける。
公式試験から得られる、実世界の認知的要求を反映した、多言語・多モーダル・多層級のベンチマークを設計する。
豊富な文脈情報、画像を用いた項目、標準化されたメタデータを備えたデータセットを提供し、堅牢なLLM評価を可能にする。
言語、推論、クロスモーダル理解の現状の強みとギャップを特定するため、幅広い多言語・多モーダルLLMを評価する。

提案手法

9言語と3つの教育段階（小学校、中学校、高校）の公式試験問題を収集する。
OCRと言語固有の注釈を適用して、必要に応じて文脈背景を付した統一のテキストベースの選択式形式を作成する。
画像を含む問題にはプレースホルダを付け、対応する画像データを多模态評価のために保持する。
言語固有のプロンプトを用いてゼロショット（および一部の少数ショット）設定でモデルを評価し、MCQの回答には制約付きデコードを適用する。
テキストオンリーと多模态モデルの評価の両方を含め、GPT-4、ChatGPT、Claude、BLOOM、Vicuna、BLIP-2、InstructBLIP、Fromage、OpenFlamingo などのモデルを使用する。

実験結果

リサーチクエスチョン

RQ1多言語LLMは、特に低リソース言語を含む、言語とスクリプトを横断する実世界の試験問題でどの程度のパフォーマンスを示すか？
RQ2画像を含む多モーダル問題は、現在の多モーダルLLMのギャップをどの程度明らかにするか？
RQ3人間と同様に教育レベルとともに単調減少するパターンになるか、それとも異なる傾向を示すか？
RQ4多言語の試験問題に対するプロンプト戦略（モノリンガル、英語指示、英語翻訳）と少数ショットのデモが与える影響は何か？
RQ5多言語LLMは、正確性と跨言語転移の観点でモノリンガルのベースラインとどう比較されるか？
RQ6複雑な推論、クロスモーダル理解、文化的知識を捉える現在のベンチマークの限界は何か？

主な発見

GPT-4 は言語を横断して最も強い性能を示すが、低リソース言語や非ラテン文字系スクリプトには依然苦戦する。
大半のモデルは多言語の問題で60%未満の正確性、非ラテン語言語や低リソーススクリプトで顕著な低下。
複雑な多模态問題では多模态モデルのパフォーマンスが低く、単一画像モデル（例：BLIP-2）はテキストのみのベースラインをわずかに上回る程度。
教育レベルを超えた非単調なパフォーマンス傾向は、LLMの知能発達が人間の学習軌跡と異なることを示唆する。
英語プロンプティング戦略は一貫して結果を改善せず、質問を英語に翻訳することで一部の言語で性能が大幅に向上することがある。
Few-shotデモは普遍的に性能を向上させるわけではなく、特定の言語でのみ役立つことがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。