QUICK REVIEW

[论文解读] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen, Shijie Chen|arXiv (Cornell University)|Oct 7, 2024

Topic Modeling被引用 6

一句话总结

ScienceAgentBench 提供经严格验证的跨四个学科的102个任务基准，用于评估语言代理生成可执行的 Python 程序以进行数据驱动的科学任务的能力，揭示当前代理在端到端自动化方面的有限能力。

ABSTRACT

The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using ScienceAgentBench, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1-preview with direct prompting and self-debug, which can boost the performance to 42.2%, demonstrating the effectiveness of increasing inference-time compute but with more than 10 times the cost of other LLMs. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

研究动机与目标

评估语言代理生成用于数据驱动发现任务的自包含 Python 程序的能力。
通过以同行评议的出版物和专家验证为任务提供依据，确保科学真实性。
提供稳健的多方面评估，包括执行、输出质量和成本指标。
减轻数据污染和捷径策略，以确保公平、与现实世界相关的评估。

提出的方法

从四个学科的 44 篇同行评议出版物中筛选出 102 个任务，并由九位领域专家进行验证。
将每个任务的输出统一为一个自包含的 Python 程序，并使用执行、输出质量和成本指标进行评估。
为每个程序实现一个基于 conda 的执行环境，并使用 pipreqs 和 pip-tools 生成包依赖以实现公平执行。
应用两种数据污染缓解策略：随机移除测试点，以及在需要时用占位值重新分割带标签的数据。
采用两阶段评估：自动指标（VER、SR、CBS、Cost）加上基于评分量表的人类评估以获得更细粒度的评估。
比较五个开源权重和专有的 LLM，在三个框架（Direct Prompting、OpenHands CodeAct、Self-Debug）下各自尝试三次任务。

实验结果

研究问题

RQ1在多样化的真实世界数据驱动发现任务集合上，当前语言代理的可实现成功率是多少？
RQ2不同的代理框架和 LLM 如何影响可执行代码生成、任务成功率和成本？
RQ3提供专家知识是否会提升代理的表现，以及在什么条件下？
RQ4执行反馈（self-debug）在提高程序生成质量方面起着怎样的作用？
RQ5数据污染缓解策略是否能够在基准评估中有效防止捷径解？

主要发现

在专家知识辅助下，表现最好的代理达到 34.3% 的 SR，凸显当前在解决数据驱动发现任务方面的局限性。
在无知识的情况下，Claude-3.5-Sonnet 结合 self-debug 达到 32.4% 的 SR，显示出基于执行的调试的适度提升。
直接提示在大多数模型上比 self-debug 的性能要低得多；self-debug 对若干代理的 SR 提升超过两倍。
在各框架和模型中，成本仍然是一个关键因素，一些低成本配置在性能上可与更昂贵的选项相竞争。
专家提供的知识可以提升 SR 和 CBS，但由于不熟悉的 API 或工具过于定制化，可能降低 VER，表明引导具有细致的利弊。
总体而言，代理在复杂且异质的科学任务，特别是涉及领域特定工具时，存在困难，表明端到端自动化尚不可实现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。