QUICK REVIEW

[论文解读] Overcoming data scarcity with transfer learning

Maxwell L. Hutchinson, Erin Antono|arXiv (Cornell University)|Nov 2, 2017

Machine Learning in Materials Science参考文献 7被引用 79

一句话总结

本论文比较三种迁移学习架构——显式潜变量、多任务学习与差异学习——在材料信息学问题中的应用，以弥合稀疏数据并保持跨数据集上下文，案例研究包括带隙/颜色数据以及 NO 还原的激活能。

ABSTRACT

Despite increasing focus on data publication and discovery in materials science and related fields, the global view of materials data is highly sparse. This sparsity encourages training models on the union of multiple datasets, but simple unions can prove problematic as (ostensibly) equivalent properties may be measured or computed differently depending on the data source. These hidden contextual differences introduce irreducible errors into analyses, fundamentally limiting their accuracy. Transfer learning, where information from one dataset is used to inform a model on another, can be an effective tool for bridging sparse data while preserving the contextual differences in the underlying measurements. Here, we describe and compare three techniques for transfer learning: multi-task, difference, and explicit latent variable architectures. We show that difference architectures are most accurate in the multi-fidelity case of mixed DFT and experimental band gaps, while multi-task most improves classification performance of color with band gaps. For activation energies of steps in NO reduction, the explicit latent variable method is not only the most accurate, but also enjoys cancellation of errors in functions that depend on multiple tasks. These results motivate the publication of high quality materials datasets that encode transferable information, independent of industrial or academic interest in the particular labels, and encourage further development and application of transfer learning methods to materials informatics problems.

研究动机与目标

激发材料信息学中的数据稀缺挑战，以及在数据集之间保留上下文差异的需要。
评估迁移学习如何利用大规模、低保真度数据集来改进小规模、高保真度数据集的预测。
在真实材料问题中比较三种TL架构（显式潜变量、多任务、差异学习）。
评估每种TL方法在案例研究中的准确性提升与可解释性权衡。

提出的方法

描述并实现三种TL 架构：显式潜变量、多任务学习和差异学习。
以随机森林作为基学习器，并使用 jackknife 基于不确定性量化。
使用交叉验证和保留集，将TL方法与单任务基线进行比较。
将 TL 架构应用于带隙与颜色预测（DFT 到实验转移）以及 NO 还原的多步反应激活能。

实验结果

研究问题

RQ1迁移学习在从低保真（DFT）到高保真（实验）带隙及相关颜色数据的预测精度提升方面是否有效？
RQ2哪种TL架构（显式潜变量、多任务、差异）在多保真带隙和颜色分类任务中表现最好？
RQ3TL方法是否提升NO还原催化中的活化能预测和速率决定步分类，哪种架构最有效？
RQ4这些TL方法在数据稀缺情境下如何平衡准确性、数据效率和可解释性？

主要发现

差异架构在多保真带隙方面相比基线与其他TL方法取得最佳性能。
显式潜变量架构通常提供稳健的改进，且在若干任务中可能超越基线。
多任务学习在某些任务（例如带隙的颜色分类）可以匹配或超过基线，但在其他任务中因标签不平衡而表现不佳。
对于NO还原的活化能，潜变量架构在各步的误差减少，并提升对速率决定步分类的F1，而多任务在RDS方面可能表现不佳。
TL方法在实验标签显著减少的情况下实现相当的性能（例如带隙方面，使用差异架构时标签减少约4倍）。
潜变量和差异架构通过揭示标签之间的关系提供更高的可解释性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。