QUICK REVIEW

[论文解读] PLLaMa: An Open-source Large Language Model for Plant Science

Xianjun Yang, Junfeng Gao|arXiv (Cornell University)|Jan 3, 2024

Biomedical Text Mining and Ontologies被引用 10

一句话总结

PLLaMa 在 LLaMa-2 的基础上，使用一个专注植物科学的训练语料库（超过 1.5 million 学术文章），随后进行指令微调，以提升植物科学问答与对话能力；公开提供检查点供社区使用。

ABSTRACT

Large Language Models (LLMs) have exhibited remarkable capabilities in understanding and interacting with natural language across various sectors. However, their effectiveness is limited in specialized areas requiring high accuracy, such as plant science, due to a lack of specific expertise in these fields. This paper introduces PLLaMa, an open-source language model that evolved from LLaMa-2. It's enhanced with a comprehensive database, comprising more than 1.5 million scholarly articles in plant science. This development significantly enriches PLLaMa with extensive knowledge and proficiency in plant and agricultural sciences. Our initial tests, involving specific datasets related to plants and agriculture, show that PLLaMa substantially improves its understanding of plant science-related topics. Moreover, we have formed an international panel of professionals, including plant scientists, agricultural engineers, and plant breeders. This team plays a crucial role in verifying the accuracy of PLLaMa's responses to various academic inquiries, ensuring its effective and reliable application in the field. To support further research and development, we have made the model's checkpoints and source codes accessible to the scientific community. These resources are available for download at \url{https://github.com/Xianjun-Yang/PLLaMa}.

研究动机与目标

提升 LLM 在特定领域植物科学中的准确性，超越通用域模型的动机。
通过对植物文献的扩展预训练来开发一个开源的植物科学导向的 LLM。
通过 instruction tuning 提升对话能力，以支持植物科学领域的学术问询。
向公众提供训练检查点和源代码，以实现可重复性和进一步研究。

提出的方法

使用 1.5 million+ 植物科学文章扩展 LLaMa-2-7B 和 LLaMa-2-13B 的预训练。
通过将 S2ORC 过滤为期刊名称来构建植物科学语料库（750 种植物科学期刊）。
将植物科学语料库与 10% General RedPajama-Data-1T-Sample 混合，以缓解灾难性遗忘。
应用 bf16、FlashAttention、zero-stage-3 DeepSpeed，以及 Fully Sharded Data Parallel (FSDP) 以实现高效训练。
使用来自 LIMA 集合的 1030 条指令及植物科学特定提示进行 instruction tuning；使用 bf16 和 FSDP 进行训练。
通过未公开的植物科学测验和零-shot 情况进行评估；报告准确性和定性评估。

实验结果

研究问题

RQ1是否可以通过领域特定的预训练显著提升开源 LLM 在植物科学任务中的性能？
RQ2指令微调是否在扩展预训练的基础上进一步增强植物科学对话与问答能力？
RQ3PLLaMa 在植物科学测验和零样本问询中的实测性能如何？
RQ4公开的检查点和代码是否足以实现可重复性以及进一步的领域特定工作？

主要发现

PLLaMa-13B-Chat 在一个 10-question 的未公开植物科学测验上实现了大约 60% 的准确率。
预训练和指令微调分别在八块 A100 GPU 和四块 A100 GPU 上进行，报告了详细的资源使用和时间线（例如，7B 约 26 小时；13B 约 57 小时用于预训练；7B 约 1.3 小时；13B 约 2.7 小时用于指令微调）。
该模型输出与领域相关的问答，并得到国际植物科学家和工程师评审组认可的有用性。
模型检查点和源代码已向社区公开，以供下载和复现。
PLLaMa 通过添加大型植物科学语料库和领域聚焦的指令微调，在 LLaMa-2 的基础上缩小与领域专家的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。