QUICK REVIEW

[论文解读] SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Hengxing Cai, Xiaochen Cai|arXiv (Cornell University)|Mar 4, 2024

Library Science and Information Systems被引用 13

一句话总结

SciAssess 是一个基准，用于评估大型语言模型在记忆、理解和分析科学文献方面的能力，覆盖不同领域，并具备严格的质量控制和多模态考虑。

ABSTRACT

Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis \& Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are available at \url{https://github.com/sci-assess/SciAssess}.

研究动机与目标

定义一个基准，用于评估 LLM 在科学文献分析中的记忆、理解和分析能力。
覆盖广泛的科学领域和任务，以反映现实世界文献的挑战。
实施严格的质量控制、匿名化和版权合规措施，以确保可靠性。

提出的方法

与布鲁姆分类法对齐的三层能力框架：记忆（L1）、理解（L2）和分析/推理（L3）。
多样的任务类型包括 True/False、Multiple-Choice、Table Extraction、Constrained Generation 和 Open-ended Generation。
覆盖的领域包括通用化学、合金材料、有机材料、药物发现和生物学，以确保广泛的代表性。
数据来自公开获取的出版物和数据库，以反映当前的科学研究。
专家交叉验证以确保正确性，并进行筛选以对敏感信息进行匿名化，以保护隐私和版权合规。

实验结果

研究问题

RQ1当前 LLM 在科学文本中的记忆、理解和推理能力与局限是什么？
RQ2LLMs 在多种科学领域和多模态数据源上的表现如何？
RQ3在文献分析中，哪一领域（记忆、理解、分析）对最先进的 LLMs 最具挑战性？
RQ4像 SciAssess 这样的基准如何为开发和部署用于科学文献分析的 LLM 提供参考？
RQ5为确保在科学领域进行可靠且符合法律合规的基准测试，需要哪些质量控制实践？

主要发现

SciAssess 评估 LLMs（例如 GPT-4、GPT-3.5、Gemini），并识别优势与改进方向。
该基准整合了三个渐进的能力层级（L1–L3），用于诊断文献分析中的特定能力。
使用广泛的领域和五种题型，以捕捉科学文本中的多样化挑战。
数据经过严格筛选，来自公开出版物和数据库，并由专家验证与匿名化处理，以确保隐私和版权合规。
强调质量控制流程（正确性检查、匿名化、版权保护）以确保可靠性和法律/道德完整性。
该框架旨在促进 LLM 在分析、综合和推理科学文献方面能力的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。