[논문 리뷰] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro는 MMLU를 확장하여 14개 도메인, 12,032문항, ten answer options, 두 차례의 전문가 검토 라운드를 통해 더 어렵고 더 강인하며 프롬프트에 안정적인 벤치마크를 제공하여 모델 능력을 더 잘 구분하고 Chain-of-Thought 추론의 이점을 얻습니다.
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
연구 동기 및 목표
- Address saturation and noise in the original MMLU benchmark by creating a more challenging, reasoning-focused dataset.
- Increase answer options from four to ten to reduce guesswork and improve robustness.
- Eliminate trivial/noisy questions and introduce expert review to improve dataset quality.
- Assess a wide range of LLMs (open- and closed-source) on the new benchmark and analyze prompt sensitivity and reasoning requirements.
- Investigate how Chain-of-Thought (CoT) reasoning affects performance on MMLU-Pro compared to direct answering.
제안 방법
- Construct MMLU-Pro by reorganizing 57 MMLU categories into 14 domains and filtering out easy questions using eight baseline models.
- Augment stem questions from multiple sources (STEM Website, TheoremQA, SciBench) with six additional distractors generated by GPT-4-Turbo to create ten-option MCQs.
- Apply a two-phase expert review (phase 1: correctness and appropriateness; phase 2: distractor validity via Gemini-1.5-Pro) to remove bad questions and false negatives.
- Use 5-shot Chain-of-Thought prompting to evaluate model reasoning on MMLU-Pro, with five discipline-representative demonstrations.
- Extract answers from model outputs via regex-based parsing of the specified
- 3-6 bullet points: proposed method, key techniques/equations

실험 결과
연구 질문
- RQ1Does MMLU-Pro increase difficulty and discrimination among top-performing LLMs compared to MMLU?
- RQ2How does model performance with Chain-of-Thought (CoT) compare to direct answering on MMLU-Pro versus MMLU?
- RQ3How robust is MMLU-Pro to prompt variations across different styles (prompt sensitivity)?
- RQ4What are the primary error sources for top models on MMLU-Pro (reasoning, domain knowledge, computation)?
주요 결과
| Model | Overall | Math | Physics | Engineering | History | Law | Psychology |
|---|---|---|---|---|---|---|---|
| GPT-4o | 72.6 | 76.1 | 74.7 | 55.0 | 70.1 | 51.0 | 79.2 |
| GPT-4-Turbo | 63.7 | 62.8 | 61.0 | 35.9 | 67.7 | 51.2 | 78.3 |
| Claude-3-Opus | 68.5 | 69.6 | 69.7 | 48.4 | 61.4 | 53.5 | 76.3 |
| Gemini-1.5-Pro | 69.0 | 72.8 | 70.4 | 48.7 | 65.6 | 50.8 | 77.2 |
| Llama-3-70B-Instruct | 56.2 | 54.0 | 49.6 | 43.6 | 56.9 | 39.9 | 70.2 |
| Phi-3-medium-4k-instruct | 55.7 | 52.2 | 49.4 | 37.9 | 57.2 | 38.3 | 73.4 |
| DeepSeek-V2-Chat | 54.8 | 53.7 | 54.0 | 31.9 | 45.3 | 40.6 | 66.2 |
| Llama-3-70B | 52.8 | 49.7 | 49.8 | 35.0 | 57.7 | 35.0 | 71.4 |
| Qwen1.5-72B-Chat | 52.6 | 52.3 | 44.2 | 36.6 | 55.9 | 38.5 | 67.7 |
| Yi-1.5-34B-Chat | 52.3 | 56.2 | 49.4 | 34.4 | 52.8 | 34.8 | 64.3 |
- GPT-4o achieves 72.6% overall on MMLU-Pro, indicating substantial room for improvement.
- GPT-4o outperforms GPT-4-Turbo and Claude-3-Opus, with a larger gap to weaker models (e.g., 9% difference between GPT-4o and GPT-4-Turbo on MMLU-Pro).
- MMLU-Pro better discriminates model differences than MMLU (e.g., GPT-4o vs GPT-4-Turbo gap grows from ~1–2% on MMLU to ~9% on MMLU-Pro).
- CoT reasoning yields larger gains on MMLU-Pro (GPT-4o rises by ~19% with CoT; on MMLU it rises ~1.5%); many models show similar trends, indicating deeper reasoning is needed for MMLU-Pro.
- Prompt sensitivity is reduced on MMLU-Pro, with 24 prompts causing ~2% variability (vs 4–5% on MMLU).
- Error analysis of GPT-4o identifies reasoning flaws (39%), lack of domain knowledge (35%), and calculation errors (12%) as primary failure modes.

더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.