QUICK REVIEW

[论文解读] MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Shuai Peng, Ke Yuan|arXiv (Cornell University)|May 2, 2021

Mathematics, Computing, and Information Processing参考文献 25被引用 53

一句话总结

MathBERT 共同在数学公式、它们的上下文和运算符树上进行预训练，以捕捉语义和结构信息，在多项数学相关任务上达到最先进的结果。

ABSTRACT

Large-scale pre-trained models like BERT, have obtained a great success in various Natural Language Processing (NLP) tasks, while it is still a challenge to adapt them to the math-related tasks. Current pre-trained models neglect the structural features and the semantic correspondence between formula and its context. To address these issues, we propose a novel pre-trained model, namely extbf{MathBERT}, which is jointly trained with mathematical formulas and their corresponding contexts. In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas. We conduct various experiments on three downstream tasks to evaluate the performance of MathBERT, including mathematical information retrieval, formula topic classification and formula headline generation. Experimental results demonstrate that MathBERT significantly outperforms existing methods on all those three tasks. Moreover, we qualitatively show that this pre-trained model effectively captures the semantic-level structural information of formulas. To the best of our knowledge, MathBERT is the first pre-trained model for mathematical formula understanding.

研究动机与目标

通过利用上下文和公式结构，推进对超越纯文本的数学公式的理解。
提出一个统一的预训练框架，联合使用公式、上下文和运算符树（OPTs）。
设计一个新颖的掩码子结构预测任务，以编码语义级别的公式结构。
构建一个基于 arXiv 的公式-上下文-OPT 三元组的大型数据集用于预训练。
在下游数学任务上展示相对于基线的改进，并提供对语义结构捕获的定性分析。

提出的方法

输入由 LaTeX 记号（公式）、其文本上下文以及一个运算符树（OPT）组成。
基于 OPT 的掩码引导 Transformer 注意力，以编码语义级别的结构。
三种预训练任务：掩码语言建模（MLM）、上下文对应预测（CCP）和掩码子结构预测（MSP）。
MSP 通过掩码的运算符子结构来预测 OPT 内部的父节点/子节点。
预训练数据：来自 arXiv LaTeX 源的 8.7 百万公式-上下文-OPT 三元组；从 BERT-base 进行初始化；最大序列长度 256。
在三个下游任务上进行评估：数学信息检索、公式主题分类和公式标题生成。

实验结果

研究问题

RQ1一个联合使用公式、上下文和运算符树的预训练模型能否提升对数学表达式的理解？
RQ2具备结构感知的预训练任务（MSP）是否能在数学相关任务上超越 MLM 和 CCP？
RQ3相较于基线，MathBERT 在 MIR、主题分类和标题生成上的表现如何？
RQ4基于 OPT 的注意力在多大程度上提升了对公式的语义捕获能力？

主要发现

MathBERT 在 NTCIR-12 MathIR 上超过基线，获得最高的部分性和加权调和均值 bpref。
在 TopicMath-100K 上，MathBERT 超越未预训练模型和原生 BERT，尤其是在公式+上下文输入时。
在公式标题生成方面，基于 MathBERT 的融合方法相比基线获得更优的 ROUGE/BLEU/METEOR 分数。
消融分析显示 OPT 与上下文对不同任务的贡献不同；OPT 提升信息检索，上下文提升主题分类。
定性分析表明 MathBERT 能捕捉超越外观的语义级别结构相似性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。