QUICK REVIEW

[论文解读] LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Andre Niyongabo Rubungo, Kangming Li|arXiv (Cornell University)|Oct 31, 2024

Machine Learning in Materials Science被引用 6

一句话总结

短评：LLM4Mat-Bench 是一个大规模基准，评估各种 LLMs 使用组合、CIF 或文本描述来预测晶体材料性质的能力，突出任务专用模型在材料性质预测方面优于通用型 LLM 的优势。

ABSTRACT

Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

研究动机与目标

推动在材料性质预测领域评估 LLM 的标准化基准的必要性。
创建一个综合、多样化的基准（LLM4Mat-Bench），涵盖多数据源、模态和性质。
评估从任务专用预测模型到通用型 LLM 的多种模型，以识别优势与局限性。

提出的方法

从 10 个数据来源汇集约 1.9M 个晶体结构，整合为 1,978,985 对应成分–结构–描述的对，且去重。
使用 Robocrystallographer 确定性地生成晶体结构描述，创建无数据污染的文本输入模态。
在多种模型家族中评估三种材料表示（Composition、CIF、Description），包括 LLM-Prop、MatBERT、Llama、Gemma、Mistral 和 CGCNN 作为基线。
对小型、任务专用模型（LLM-Prop、MatBERT）进行微调，并将其与更大、类似对话的 LLM 的零样本与小样本提示进行对比。
使用固定的训练/验证/测试划分和标准评估指标（回归的 MAD：MAE，分类的 AUC），以实现可重复的对比。

实验结果

研究问题

RQ1LLMs 是否可以在跨多数据源和输入模态的材料性质预测中被有效使用？
RQ2任务专用、较小的 LLM 是否在材料性质预测方面优于通用的对话型 LLM？
RQ3哪种输入表示（Composition、CIF、Description）能够为基于 LLM 的模型带来最佳预测性能？
RQ4对话型 LLM 的零样本和小样本提示评估与微调预测模型在该领域相比有何差异？

主要发现

任务专用、较小的预测性 LLM（LLM-Prop 和 MatBERT）在回归和分类任务中均优于通用的对话型 LLM。
基于描述的输入通常为基于 LLM 的性质预测模型带来更好的性能，优于 CIF 或成分输入。
更先进、规模更大的生成型 LLM 在改进上有限，且经常会产生无效输出或在材料性质上产生幻觉。
Energetic properties 在不同数据集中比其他性质预测得更准确。
对 MP 数据的微调可能有效，但收益在数据集和性质之间存在差异；通用 LLM 需要针对性微调才能在该领域出色。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。