QUICK REVIEW

[论文解读] KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jifan Yu, Xiaozhi Wang|arXiv (Cornell University)|Jun 15, 2023

Topic Modeling被引用 24

一句话总结

KoLA 设计了一个以知识为焦点的四级认知分类法，涵盖已知与进化数据、对比得分系统，以及用于评估 28 个 LLMs 在 19 项任务上的自对比指标。它提供每季度更新以跟踪进展。

ABSTRACT

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For extbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For extbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For extbf{evaluation criteria}, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

研究动机与目标

使用受布鲁姆分类法启发的四级认知分类法来构建世界知识（知识记忆、理解、应用、创造）.
通过将已知数据（Wikipedia 子集）与进化数据（最近的文章）相结合，公平评估 LLMs 的记忆与对新知识的适应能力。
提供带有标准化分数的对比评估框架，以实现跨任务的可比性，并使用自对比指标评估知识创造。
提供 KoLA 的季度季节以跟踪发展并为提升 LLM 知识系统提供可操作的诊断。

提出的方法

采用四级认知知识分类法（KM、KU、KA、KC）来组织覆盖记忆、理解、应用、创造知识的 19 项任务。
使用双数据源：已知数据来自 Wikipedia/Wikidata5M，进化数据来自最近发表的文章，以测试记忆与更新能力。
实现跨任务标准化分数的对比评估系统，以实现跨模型可比性，并使用自对比指标进行知识创造评估。
通过在有无前知识 K 的条件下对比模型输出，设计基于 Rouge-L 相似度的自动评估来计算混合 KC 分数。

实验结果

研究问题

RQ1LLMs 在世界知识的记忆、理解、应用和创造方面有何差异？
RQ2模型规模与对齐对已知数据与进化数据的不同知识能力有何影响？
RQ3标准化的跨任务分数是否可以在多种 LLMs 之间提供公平、可解释的排行榜？
RQ4自对比指标是否能有效评估知识创造并降低幻觉影响？

主要发现

较大的基础模型在未对齐时往往记忆更多知识，且在 KM 上具有显著的规模效应。
对齐与指令微调提升更高层次的能力（KA、KC），但可能降低原始记忆（KM），揭示对低层记忆的对齐税。
商业模型在标准化 KoLA 得分上通常优于开源模型，开源模型整体表现较弱。
在指令微调后，模型规模与更高层次能力之间的相关性更强，而 KM 的记忆提升则不那么显著。
KoLA 的进化数据季能够更公平地评估未见知识，并随时间跟踪模型发展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。