QUICK REVIEW

[论文解读] Are large language models superhuman chemists?

Adrian Mirza, Nawaf Alampara|arXiv (Cornell University)|Apr 1, 2024

History and advancements in chemistry被引用 17

一句话总结

论文介绍 ChemBench，一个包含 2,788 个化学问答对的基准，用于评估 LLMs；领先模型在平均水平上超越了顶级的人类化学家，但在某些任务和置信度校准方面仍然存在困难。

ABSTRACT

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here, we introduce "ChemBench," an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.

研究动机与目标

创建一个标准化基准（ChemBench），用于评估 LLMs 中超越属性预测的化学知识、推理和直觉。
评估当前最先进的 LLMs 相比专家化学家在一个广泛、对教育对齐的化学语料库上的表现。
分析模型在化学子领域和问题类型上的表现，以识别优势与差距。
提供开放、可扩展的评估基础设施和排行榜，以跟踪未来进展并促进更安全、更加有用的化学 AI 系统。

提出的方法

从手动和半自动来源整理覆盖本科及研究生化学主题的 2,788 个问答对。
使用带注释的标记对化学特定模态（例如 SMILES）进行编码，以支持工具增强系统。
评估广泛的模型，包括开源和闭源 LLMs，采取严格的正确/错误评分和工具辅助设置。
对问卷中的一部分问题进行人类专家调查，建立基线并探索专家与模型的一致性。
实现解析和提示流水线以提取最终文本完成结果，包括处理如 SMILES 和方程等领域特定格式。

实验结果

研究问题

RQ1与专家化学家相比，最先进的 LLMs 在广泛的化学基准测试中的表现如何？
RQ2在不同化学主题和问题类型上，LLMs 的优势与局限是什么？
RQ3模型规模、工具增强和领域特定编码在化学推理任务中的表现影响到何种程度？

主要发现

最佳模型在 ChemBench 上整体上超越平均人类化学家（最佳模型约为人类水平的两倍）。
开源模型（例如 Llama-3.1-405B-Instruct）在若干任务上接近领先的专有模型的性能。
在知识密集型问题和特定子领域如毒性/安全性和分析化学方面仍存在性能差距（例如 NMR 信号计数仍然困难）。
模型性能并不始终随分子复杂性增加而提升，表明更多依赖训练数据接近度而非结构推理。
模型常给出过于自信或校准不当的不确定性估计，提升了现实使用中的安全性和可靠性担忧。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。