QUICK REVIEW

[論文レビュー] AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui|arXiv (Cornell University)|Apr 13, 2023

Artificial Intelligence in Healthcare and Education被引用数 61

ひとこと要約

AGIEval は、8,062問を20タスクにわたり、人間の試験ベースのベンチマークを提示し、基盤モデルを評価します。GPT-4 は一部の人間中心のテストで優れた成績を示しますが、複雑な推論や分野特有の知識には苦戦します。

ABSTRACT

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

研究の動機と目的

人間の認知と意思決定に沿った公式試験のタスクに評価を中心化する。
客観的な問題形式を用いた堅牢で標準化された自動指標を提供する。
英語と中国語のタスクで多言語能力をベンチマークする。
モデル出力を公開してLM評価の透明性と再現性を促進する。

提案手法

高水準の公式試験（高考、SAT、LSAT、GMAT、AMC/AIME、公務員試験、弁護士受験など）から問題を収集する。
標準化スコアリングのため、選択肢問題と穴埋め問題の客観的項目のみを含める。
選択肢問題は正確さ、穴埋め問題は Exact Match/F1 を指標として使用する。
Chain-of-Thought prompting の有無を含むゼロショットとFew-shot設定でモデルを評価する。
Text-Davinci-003、ChatGPT、GPT-4 の Azure OpenAI Service API を、生成設定固定（temperature 0、max tokens 2048）で使用する。
全てのモデル出力を公開して分析と再現性を支援する。

実験結果

リサーチクエスチョン

RQ1公式試験から派生した人間レベルの現実世界タスクにおいて、最新の基盤モデルはどれほどの性能を示すか？
RQ2理解、知識、推論、計算の観点から、二言語タスクにおけるこれらのモデルの強みと限界は何か？
RQ3Chain-of-thought プロンプトと Few-shot シナリオは人間中心の推論タスクの性能を向上させるか？
RQ4モデルの性能は、さまざまな試験における平均的・トップの人間受験者とどのように比較されるか？

主な発見

GPT-4 はゼロショット CoT 設定下で、SAT、LSAT、数学競技で平均的人間の性能を上回る。
GPT-4 は SAT Math で 95% の正答率、中国の高考の英語試験で 92.5% の正答率を達成する。
複雑な推論や特定の分野知識（例：法、化学、物理学）が必要なタスクでモデルは苦戦する。
理解・知識・推論・計算の各能力を横断的に評価することで、各モデルの顕著な強みと限界が明らかになる。
このベンチマークは、AGI へ向けた一般的な能力向上の方向性に関する洞察を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。