QUICK REVIEW

[论文解读] What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Taicheng Guo, Kehan Guo|arXiv (Cornell University)|May 27, 2023

Machine Learning in Materials Science被引用 91

一句话总结

论文在八个化学任务上基准测试五种大语言模型（GPT-4、GPT-3.5、Davinci-003、Llama、Galactica），以评估理解、推理和解释能力，结果显示GPT-4通常领先，但在 SMILES-heavy 生成任务上表现不足。

ABSTRACT

Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.

研究动机与目标

评估大语言模型在化学领域的八个实际任务（理解、推理和解释）的能力。
使用五个著名的LLM评估零-shot 与少样本在-context学习。
识别相对于领域特定基线的优势、劣势及任务依赖性表现。
为研究人员和化学家在化学领域使用LLMs与提示策略方面提供可操作的指导。

提出的方法

使用 PubChem、BBBP、Tox21、HIV、BACE、USPTO、ChEBI 等数据集对八项化学任务进行基准测试。
在零-shot 和少样本提示下评估五种LLM（GPT-4、GPT-3.5、Davinci-003、Llama、Galactica）。
设计针对特定任务的上下文学习提示，包含示范与四段式模板以降低幻觉。
研究 ICL 检索策略（随机与支架 Scaffold）及不同的 k 示例数对表现的影响。
重复评估五次以考虑模型随机性并报告均值与方差。

实验结果

研究问题

RQ1在零-shot 与少样本提示下，不同LLM在八项化学任务上的相对表现如何？
RQ2提示设计、示范质量和 ICL 检索策略如何影响LLM在化学中的表现？
RQ3哪些化学任务最适合LLM，哪些任务需要特定基线？
RQ4在处理化学表示（如 SMILES）时，LLMs 的主要局限性和幻觉模式是什么？

主要发现

GPT-4 在各任务上通常优于其他被评估的模型。
GPT 模型在以 SMILES 为主的任务（如名称预测、反应预测和 retrosynthesis）上表现较差。
属性和产率预测在LLM 下表现具竞争力或在某些基线下具有选择性优势。
文本生成任务（分子设计和字幕/描述）表现出较强的定性与定量能力，尽管完全匹配的结果有限。
SELFIES 表示在这些 LLMs 中表现低于 SMILES，可能是由于训练数据对 SMILES 的偏向。
上下文学习相对于零-shot 提升了性能，且基于 scaffold 的检索往往优于随机抽样。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。