QUICK REVIEW

[论文解读] Fast machine learning models of electronic and energetic properties consistently reach approximation errors better than DFT accuracy

Felix A. Faber, Luke D. Hutchinson|arXiv (Cornell University)|Feb 17, 2017

Machine Learning in Materials Science被引用 11

一句话总结

本研究利用多样化的分子表征方式与回归器，开发出快速的机器学习模型，用于预测有机分子的13种电子性质与能量性质。结果表明，这些模型的样本外预测误差低于化学精度标准，且在与实验值偏差方面持续优于杂化密度泛函理论（DFT），提示若使用实验数据或高度相关联的量子力学数据进行训练，其精度潜力可能更高。

ABSTRACT

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.

研究动机与目标

评估分子表征方式与回归器对电子性质与能量性质机器学习模型精度的影响。
评估机器学习模型是否能在与实验值的偏差上实现低于杂化密度泛函理论（DFT）的预测误差。
确定机器学习模型是否能够达到或超越化学精度，用于预测基态分子性质。
识别实现高精度、快速预测的最优分子表征方式与回归器组合。
探索当使用更高精度参考数据（如实验数据或显式相关联的量子力学数据）进行训练时，机器学习模型超越DFT的潜力。

提出的方法

在包含约11.7万个有机分子、且其性质由DFT计算得到的QM9数据库上训练并测试模型。
评估了多种分子表征方式：库仑矩阵、键的集合、BAML、ECFP4、分子图（MG），以及新型基于分布的变体（HD、HDA/MARAD、HDAD）。
应用了多种回归器：贝叶斯岭回归（BR）、弹性网络正则化线性回归（EN）、随机森林（RF）、核岭回归（KRR）、图卷积网络（GC）与门控图网络（GG）。
使用学习曲线测量作为训练集大小函数的样本外预测误差，从而系统比较模型性能。
将模型预测结果与杂化DFT参考值直接比较，并评估其与实验基准的偏差。
使用均方根误差（RMSE）和化学精度阈值（热化学性质为1 kcal/mol，电子性质为0.01 eV）量化模型精度。

实验结果

研究问题

RQ1在DFT数据上训练的机器学习模型是否能实现相对于实验值的预测误差低于杂化DFT？
RQ2哪些分子表征方式与回归器组合能在多种电子性质与能量性质上实现最精确的预测？
RQ3与标准描述符相比，基于分布的表征方式（如距离、角度、二面角的直方图）在多大程度上提升了模型性能？
RQ4是否有任何机器学习模型在所有13项研究性质上的样本外预测误差均低于化学精度阈值？
RQ5若使用更高精度的参考数据（如实验数据或显式相关联的量子力学数据）进行训练，机器学习模型超越DFT的潜力有多大？

主要发现

所有机器学习模型在与杂化DFT比较时，其样本外预测误差均低于杂化DFT与实验值之间的典型误差。
表现最佳的模型在所有性质上的预测误差均持续低于化学精度标准（热化学性质为1 kcal/mol，电子性质为0.01 eV）。
基于图的模型（GC与GG）结合基于分布的表征方式（如HD、HDA/MARAD）在大多数性质上表现出更优性能。
即使线性模型如贝叶斯岭回归，只要搭配适当的表征方式（如分子图或直方图），也能实现亚化学精度。
本研究提供了数值证据，表明若使用实验数据或高度相关联的量子力学数据进行训练，机器学习模型的精度可超越杂化DFT。
学习曲线显示模型收敛迅速，预测误差在较小的训练集规模下即趋于稳定，表明具有极高的数据效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。