QUICK REVIEW

[论文解读] Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

Zhouhong Gu, Xiaoxuan Zhu|arXiv (Cornell University)|Jun 9, 2023

Topic Modeling被引用 9

一句话总结

Xiezhi 是一个全面的、自动更新的基准，用于在516个学科中评估整体现象性领域知识，包含249,587道题，以及用于检测47个LLM跨领域能力的专门子集。

ABSTRACT

New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.

研究动机与目标

需要新的、更广泛的基准来区分LLM在多学科上的能力的动机。
提出一个与中国学科分类法对齐的大规模、自动更新的领域知识基准。
创建专门子集（Xiezhi-Specialty 和 Xiezhi-Interdiscipline）以体现领域特定和跨领域推理。
设计一个包含每道MCQ 50个选项、按生成概率排序的评估设置，以揭示真实模型能力。

提出的方法

汇编涵盖516个学科、来自13个广泛类别的249,587道MCQ题目。
人工标注来自研究生入学考试的2万道题，形成 Xiezhi-Meta，并进行多标签学科标注。
利用标注模型自动生成并标注10.7万道题来自多样化考试，以及8万道题来自调查问卷，并进行微调分类器的标注。
构建 Xiezhi-Specialty（3个学科或更少）和 Xiezhi-Interdiscipline（4个及以上学科）以实现细化评估。
引入50选项的MCQ，并以生成概率排序取代基于指令选择，以降低随机猜测的影响。
在中文和英文环境下，评估47个开源LLM和基于API的模型（ChatGPT、GPT-4），在0-shot、1-shot和3-shot设置下。

Figure 1: In Chinese mythology, the Xiezhi is a legendary creature known for its ability to discern right from wrong and uphold justice. Xiezhi Benchmark encompasses 13 distinct disciplinary categories, 118 sub-disciplines, and 385 further fine-grained disciplines, aiming to provide an extensive dom

实验结果

研究问题

RQ1一个覆盖516个学科的整体现象性领域知识基准的覆盖度、时效性和标注质量如何？
RQ2在多个学科中使用50选项MCQ和生成概率排序评估时，现代LLM的表现如何？
RQ3专门数据集（Xiezhi-Specialty 和 Xiezhi-Interdiscipline）是否暴露出相对于完整基准的LLM的不同强项或局限？
RQ4预训练与微调对领域知识表现的影响如何，模型规模和数据平衡如何影响结果？
RQ5Xiezhi 能否区分从 GPT-4 到较小参数模型的LLM之间的细粒度能力差异？

主要发现

在领域数据上进行微调的顶级开源LLM在科学、工程、农学与医学方面超过普通人，但在经济学、法学、教育学、文学、历史和管理方面落后。
GPT-4 和 ChatGPT 展现出强烈的少量示例改进作用，而许多较小的LLM未能从演示中获得一致收益。
仅靠模型规模并不能保证更好的表现；选定的架构与训练数据的平衡决定结果。
专业化的医学领域微调实现高医疗领域表现，但可能以牺牲通用领域理解为代价。
Xiezhi 在基准中表现方差最高，表明其具有在不同模型之间区分LLM能力的强大能力。

Figure 2: The figure on the right is the statistics of all questions collected by Xiezhi. The middle figure shows statistics for Xiezhi-Specialty and the left shows Xiezhi-Interdiscipline.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。