QUICK REVIEW

[论文解读] AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui|arXiv (Cornell University)|Apr 13, 2023

Artificial Intelligence in Healthcare and Education被引用 61

一句话总结

AGIEval 提供一个双语、基于人工考试的基准测试，涵盖 8,062 道题、20 项任务，用于评估基础模型；结果显示 GPT-4 在某些以人为中心的测试中表现出色，但在复杂推理和领域特定知识方面仍有挑战。

ABSTRACT

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

研究动机与目标

通过官方考试，将评估聚焦于与人类认知和决策相符的任务。
提供一种稳健、标准化、自动的度量方法，使用客观题型。
以英语和中文任务为基准，评估多语言能力。
发布模型输出，以促进语言模型评估的透明度和可重复性。

提出的方法

从高标准、官方考试中整理题目（高考、SAT、LSAT、GMAT、AMC/AIME、公务员考试、律师考试）。
仅包括客观题（选择题和填空题）以实现标准化评分。
以准确率衡量选择题，以 Exact Match/F1 衡量填空题作为评测指标。
在零样本、少量样本设置下评估模型，结合是否使用 Chain-of-Thought 提示。
使用 Azure OpenAI Service API 调用 Text-Davinci-003、ChatGPT 与 GPT-4，设定固定的生成参数（temperature 0，max tokens 2048）。
发布所有模型输出，以支持分析和可重复性。

实验结果

研究问题

RQ1最先进的基础模型在来自官方考试的与人类水平、现实世界任务上的表现如何？
RQ2在双语任务中，这些模型在理解、知识、推理和计算方面的优点与局限性是什么？
RQ3链式思考提示和少样本情景是否提升了在以人为中心的推理任务上的表现？
RQ4在各种考试中，模型表现与平均水平和顶尖人类考生相比如何？

主要发现

GPT-4 在零样本 Chain-of-Thought 设置下，在 SAT、LSAT 和数学竞赛上超过平均人类表现。
GPT-4 在 SAT 数学部分取得 95% 的准确率，在中国高考英文试部分取得 92.5% 的准确率。
模型在需要复杂推理或特定领域知识的任务上存在困难（例如法律、化学、物理）。
对理解、知识、推理与计算的综合评估揭示了各模型的不同优点和局限性。
该基准测试为提升通用能力迈向通用人工智能（AGI）提供了方向性见解。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。