QUICK REVIEW

[论文解读] SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Liang Xu, Anqi Li|arXiv (Cornell University)|Jul 27, 2023

Topic Modeling被引用 22

一句话总结

SuperCLUE 引入了一个包含三个组成部分（CArena、OPEN、CLOSE）的中文大模型基准，以反映真实用户偏好，结果显示开放式问题与闭合式问题同样必要；在 OPEN 问题上由 GPT-4 作为自动评判；评估了 11 个模型，发现中文大模型与 GPT-4 之间存在较大差距。

ABSTRACT

Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com

研究动机与目标

动机：衡量在真实世界、以用户为中心的中文场景中，LLM 能力，超越只关注闭合式准确性的评价。
开发一个多组成部分的基准（CArena, OPEN, CLOSE），以捕捉开放式对话与遵循指令的能力。
分析开放式与闭合式评估之间以及真实用户偏好之间的关系。
展示使用 GPT-4 作为中文开放式回答自动评判的可行性。

提出的方法

以 LangYa Leaderboard 为基础，使用用户报告的胜/平作为金标准，构建 CArena。
将 OPEN 设为 600 道开放式问题（10 个能力类别中每类 30 道单轮 + 30 道多轮）。
通过 GPT-3.5 将 OPEN SINGLE 的干项转化为四选一题，并经人工验核。
在零-shot 设置下评估八个面向中文的 LLM 和三个全球可获取的模型。
使用 GPT-4 作为对比模型评估的自动开放式评估的评判。
分析 CLOSE 与 OPEN 评价以及 CArena 之间的相关性，以理解其互补价值。

实验结果

研究问题

RQ1开放式（OPEN）与闭合式（CLOSE）格式在中文 LLM 交互中多大程度上反映真实用户偏好？
RQ2GPT-4 能否作为中文开放式回答的可靠自动评判，其判断与人类评分者的一致性如何？
RQ3CArena 用户评分、OPEN 表现和 CLOSE 准确度在中文 LLMs 间的关系是什么？
RQ4OPEN 与 CLOSE 评估的组合作为是否比任一单一格式更能预测真实世界的用户偏好？

主要发现

GPT-4 在 OPEN 和 CLOSE 基准测试中胜过所有模型，在 OPEN 和 CLOSE 结果中中文 LLM 与 GPT-4 之间存在显著差距。
MiniMax 在所测试的中文 LLM 中排名第一，在若干能力领域与 ChatGLM2-6B 互补。
GPT-4 在 OPEN 评估中与人类评分者高度一致（Pearson 相关系数约 0.80）。
CLOSE 准确度单独对 OPEN 类似的互动场景中的用户偏好反映较差；OPEN 与 CLOSE 互补于预测 CArena 结果。
OPEN MULTIPLE（多轮）与 CArena 偏好相关性强于 OPEN SINGLE，表明多轮上下文更能捕捉用户偏好。
在各模型中，CLOSE 结果大多聚集在 55-60% 的准确率，而 OPEN 结果差异很大，凸显闭合式指标在真实世界能力上的局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。