QUICK REVIEW

[论文解读] Scientific Large Language Models: A Survey on Biological & Chemical Domains

Qiang Zhang, Keyang Ding|arXiv (Cornell University)|Jan 26, 2024

Machine Learning in Materials Science被引用 31

一句话总结

本综述系统性地评估专注于生物学和化学领域的科学大型语言模型（Sci-LLMs），涵盖文本、分子、蛋白质、基因组和多模态LLMs、架构、数据、评估以及挑战。

ABSTRACT

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

研究动机与目标

定义并形式化在生物学和化学领域中的科学语言概念与Sci-LLMs。
调研现有的Text-Sci-LLMs、Mol-LLMs、Prot-LLMs、Genomic-LLMs与MM-Sci-LLMs的架构、数据与评估。
总结用于科学语言建模的数据集、基准和评估标准。
识别关键挑战并提出未来在Sci-LLMs领域的研究方向。

提出的方法

将Sci-LLMs分为编码器-仅、解码器-仅、编码器-解码器三种架构。
回顾用于文本与领域特定语料的预训练和微调数据集。
整理模型在文本、分子、蛋白质、基因组和多模态设置中的能力与下游任务。
评估科学语言（分子、蛋白质、基因组）如何被LLMs表示和处理。
综合局限性并提出推动多模态Sci-LLMs发展的方向。

实验结果

研究问题

RQ1在生物学和化学领域，哪些架构与训练范式对Sci-LLMs最为有效？
RQ2哪些数据集与基准推动文本与领域特定的Sci-LLMs的进步？
RQ3Mol-LLMs、Prot-LLMs、Genomic-LLMs与MM-Sci-LLMs在能力与评估方面有何差异？
RQ4多模态科学语言建模的主要挑战与未来方向是什么？

主要发现

该综述提供了跨文本、分子、蛋白质、基因组与多模态领域的Sci-LLMs的结构化分类。
它汇总了模型家族、数据集和评估基准，并阐明了Sci-LLMs的预训练和微调是如何进行的。
它强调了科学语言（分子、蛋白质、基因组）与自然语言在表示与语法上的差异。
它识别了数据可用性、跨模态对齐以及科学任务评估方面的关键挑战。
它讨论了包括多模态整合和领域特定评估标准在内的有前景的方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。

[论文解读] Scientific Large Language Models: A Survey on Biological &amp; Chemical Domains