QUICK REVIEW

[论文解读] SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu|arXiv (Cornell University)|Jul 20, 2023

Online Learning and Analytics被引用 16

一句话总结

SciBench 提供一个大学水平、开放式基准，用于评估大语言模型在跨数学、化学和物理的多步科学问题解决中的表现，包含多模态和闭卷考试子集；结果显示当前的 LLMs 还存在困难，文本数据的最佳平均约为 43%、多模态为 13.8%，表明仍有很大改进空间。

ABSTRACT

Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

研究动机与目标

推动将基准从高中水平转向大学层面的科学推理任务。
提供一个大型、经过筛选的开放式、多步问题数据集，涵盖数学、化学和物理。
包含一个带有可视化元素的多模态子集，以评估多模态大语言模型的能力。
提供一个闭卷考试数据集，以模拟真实世界的评估并减少训练数据泄露。
提出一种自我修正方法，以诊断技能差距并分析提示策略对问题求解能力的影响。

提出的方法

汇编 SciBench，包含来自大学水平的物理、化学和数学的 789 道教科书级问题以及 94 道多模态问题，外加来自 CS 和数学课程的 103 道闭卷考试问题。
提供详细的 LaTeX 格式解答以及两种答案格式（数值和 LaTeX 表达式），并附单位。
在文本和多模态设置下，对单模态与多模态的开源和专有大语言模型，在零-shot、少量样本、连锁思维（CoT）以及外部工具提示（Python/Wolfram）下进行评估。
使用零温度设定并对数值答案采用相对 5% 的容差；手动对齐并验证提示工具和解答。
开发一个由 LLM 支持的自我批评验证器，将错误解映射到十种已识别的科学问题解决技能，以进行错误分析。

实验结果

研究问题

RQ1SciBench 是否能够区分不同 LLM 在大学水平的多步问题上的表现？
RQ2提示策略（CoT、少量样本）和外部工具如何影响不同的问题解决技能？
RQ3LLMs 在大学水平科学问题上的常见失败模式和技能缺陷有哪些？
RQ4多模态上下文（可视/图数据）是否对 LLM 性能和工具使用产生显著影响？

主要发现

在所评估模型中，文本表现最好的为 43.22%，在最强配置下（Few-Shot CoT 与 Python 工具使用）。
多模态子集的表现显著较低，在 GPT-4 变体的最强配置下，平均约为 13.8%。
闭卷考试数据集在最强配置下的平均为 51.57%，但仍低于人类表现。
CoT 提升计算能力，但在零-shot 设置下可能削弱诸如随意推理和逻辑分解等其他技能。
外部工具可以减少计算错误，但可能削弱诸如代码转换等其他技能；少量样本学习并不普遍提升科学问题解决能力。
SciBench 能够有效地按容量和提示策略区分模型，强调没有一种单一的提示策略能普遍优于其他策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。