QUICK REVIEW

[論文レビュー] Crystal Structure Generation with Autoregressive Large Language Modeling

Luis M. Antunes, Keith T. Butler|arXiv (Cornell University)|Jul 10, 2023

Machine Learning in Materials Science被引用数 18

ひとこと要約

CrystaLLM は CIF テキスト上で自己回帰型 Transformer を訓練し、無機結晶構造の妥当性の高い生成を行い、エネルギー予測子を用いた MCTS 指向の改善で構造の品質と現実性を高める。見たことのない化学式へ一般化し、CSP ベンチマーク上で拡散/VAE ベースのベースラインと比較して有利に働く。

ABSTRACT

The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.

研究の動機と目的

CSP ワークフローを加速するための、妥当な無機結晶構造の迅速かつ柔軟な生成を動機づける。
結晶構造を CIF 形式のトークン列として扱い、結晶化学のワールドモデルを学習するモデルの開発。
未知の構造の生成を実証し、構造クラスや空間群を横断した一般化を評価する。
エネルギー予測子主導のモンテカルロ木探索と統合して、物理的に意味のある候補の生成を改善する。

提案手法

CIF ファイル数百万を用いてデコーダーのみの Transformer を訓練する（25M パラメータの小型モデルと 200M パラメータの大型モデル）。
CIF をトークン列として表現し、次のトークンを自己回帰的に予測して、プロンプト（セル組成、空間群）に条件付けられた新しい CIF ファイルを生成する。
生成された CIF の構文的有効性と化学的妥当性を、保持外のテスト構造と比較して評価する。
未知の化学式と空間群でプロンプトして一般化を評価し、訓練データに含まれない文献由来の構造を含むチャレンジセットを用いて評価する。
MCTS（モンテカルロ木探索）を ALIGNN の原子ごとの形成エネルギー予測子と組み合わせて、低エネルギー構造へとサンプリングを導く。

Figure 1 : a Core concepts in training a Large Language Model of CIF files: A CIF file (left) is converted into a sequence of symbols, through tokenization. The sequence is processed by the model, which produces a list of probability distributions over the vocabulary, for each corresponding symbol i

実験結果

リサーチクエスチョン

RQ1CrystaLLM は未知の無機構造に対して構文的に有効な CIF ファイルを生成できるか？
RQ2CrystaLLM は未知の組成と空間群へどの程度一般化するか？
RQ3空間群条件付けを含むことは、生成品質を改善し、既知の構造への適合に寄与するか？
RQ4ベンチマークデータセット上で、CrystaLLM は diffusion/ VAE ベースの CSP モデルとどのように比較されるか？
RQ5検索戦略（エネルギー予測付きの MCTS）は生成構造の品質を向上させられるか？

主な発見

小型モデルは、保持外のテストセットで空間群なしで 93.8%、空間群ありで 94.0% の有効 CIF 生成を達成。最長の有効 CIF 長は、空間群なしで 1145 トークン、空間群ありで 970 トークンだった。
空間群条件付けにより、テストセットの生成 CIF における構造的一貫性指標（例：空間群の一貫性 99.1%、原子位置の重複度の一貫性 99.4%）が高かった。
70 構造のチャレンジセット（文献由来で未知 58、訓練で既知 12）に対して、小型モデルは空間群なしで 85.7%、空間群ありで 88.6%、大型モデルはそれぞれ 87.1%、91.4% を達成。未知一致率は、大型モデルの空間群ありで最大 41.4% に達した。
CrystaLLM は未知の化合物に対するチャレンジセットの真の構造と一致する割合を、大型モデルで約 40% まで達成（空間群が提供されるとさらに高い）。
CrystaLLM はいくつかの CSP ベンチマークで RMSE の点で CDVAE および DiffCSP を上回り、20 サンプルを各テスト組成で使用した場合に特に有効で、対称性空間群を条件付ける独自の能力を示した。
このアプローチは類推による構造生成を可能にし（例：ZrMn6Sn6 に類似したモチーフを置換して生成）、ルチル、スピネル、エルパソライト、ピロクロアなどの複雑なクラスにも妥当な構造を生み出すことができる。

Crystal Structure Generation with Autoregressive Large Language Modeling

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。