QUICK REVIEW

[论文解读] Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Xiaotian Zhang, Chunyang Li|arXiv (Cornell University)|May 21, 2023

Topic Modeling被引用 18

一句话总结

本文提出 GAOKAO-Bench，这是一个基于中国高考的基准，用于评估 LLM，在零-shot 性能和对客观题与主观题的人类对齐方面进行分析，发现客观题优势及需改进的领域。

ABSTRACT

Large Language Models(LLMs) have demonstrated remarkable performance across various natural language processing tasks; however, how to comprehensively and accurately assess their performance becomes an urgent issue to be addressed. This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples, including both subjective and objective questions. To align with human examination methods, we design a method based on zero-shot settings to evaluate the performance of LLMs. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.Our findings reveal that LLMs have achieved competitive scores in Chinese GAOKAO examination, while they exhibit significant performance disparities across various subjects. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores. In conclusion, this research contributes a robust evaluation benchmark for future large language models and offers valuable insights into the advantages and limitations of such models.

研究动机与目标

为使用 GAOKAO 问题的中国教育任务提供领域特定、与人类对齐的评估动机。
提供覆盖 2010–2022 年高考数据、覆盖所有科目的基准，以评估 LLM 的能力。
评估映射题目到模型输出的零-shot 提示效果。
区分模型在客观题与主观题上的表现，并识别科目特定的优势与劣势。

提出的方法

将 GAOKAO 问题数据（2010–2022）汇编为包含数学公式的 LaTeX 的 JSON 语料库。
应用针对题型定制的零-shot 提示，从 LLM 生成多种输出。
通过与标准答案的精确匹配对客观题进行评分；通过人工专家评估对主观题进行评分。
通过邀请高中教师参与来验证评分，使结果与人类基准保持一致。
按科目和题型分析评分率，以识别强项（如英语）和薄弱项（如物理、化学、Math_I）。

实验结果

研究问题

RQ1在零-shot 设置下，大型语言模型对 GAOKAO 问题的表现如何？
RQ2在客观题与主观题之间，LLM 在各科目上的相对表现如何？
RQ3哪些科目或题型揭示了 LLM 表现与人类基准之间最大的差距？

主要发现

模型在客观题上表现最佳，在已识别的英语题型中得分率较高（如 English_Reading_Comp = 88.3%，English_MCQs = 78.1%，English_Fill_in_Blanks = 73.8%）。
模型在主观题上的评分总体较低，且随科目而异，物理、化学、生物和数学因计算与推理要求而存在更大差距。
总体而言，模型在知识性问题上表现较强，但在较长的中文阅读理解和某些逻辑/数学推理任务上存在困难。
科目层面分析表明英语相关任务最强，而物理、化学和 Math_I 构成显著挑战。
由上海市曹杨二中教师进行的人类评估用于使主观评分与人类基准保持一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。